0% found this document useful (0 votes)
54 views24 pages

DATA MINING UNIT-II NOTES

Association Rule Mining focuses on identifying frequent patterns and correlations in large datasets, which are essential for tasks like data classification and clustering. Market Basket Analysis is a key application that helps retailers understand customer purchasing habits, enabling effective marketing strategies. The Apriori algorithm is a widely used method for mining frequent itemsets, employing a two-step process of candidate generation and support counting.

Uploaded by

21ag1a05h2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views24 pages

DATA MINING UNIT-II NOTES

Association Rule Mining focuses on identifying frequent patterns and correlations in large datasets, which are essential for tasks like data classification and clustering. Market Basket Analysis is a key application that helps retailers understand customer purchasing habits, enabling effective marketing strategies. The Apriori algorithm is a widely used method for mining frequent itemsets, employing a two-step process of candidate generation and support counting.

Uploaded by

21ag1a05h2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Mining

Unit-II
Association Rule Mining is to find out association rules or Frequent patterns or
subsequences or correlation relationships among large set of data items that satisfy
the predefined minimum support and confidence from a given database.

Frequent patterns are patterns (such as itemsets, subsequences, or substructures)


that appear in a data set frequently. For example, a set of items, such as milk and
bread, that appear frequently together in a transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern. A substructure can refer to different structural forms, such as subgraphs,
subtrees, or sublattices, which may be combined with itemsets or subsequences.
Mining Frequent Patterns, Associations, and Correlations:
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that
appear frequently in a data set. For example, a set of items, such as milk and bread
that appear frequently together in a transaction data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees,
or sublattices, which may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured pattern.
Finding frequent patterns plays an essential role in mining associations, correlations,
and many other interesting relationships among data. Moreover, it helps in data
classification, clustering, and other data mining tasks. Thus, frequent pattern mining
has become an important data mining task and a focused theme in data mining
research.
Applications of Association rule mining:
Basket data analysis, cross-marketing, catalog design, loss-leader analysis,
clustering, classification, etc.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Market Basket Analysis: A Motivating Example
Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational datasets. With massive amounts of data
continuously being collected and stored, many industries are becoming interested in
mining such patterns from their databases. The discovery of interesting correlation
relationships among huge amounts of business transaction records can help in many
business decision-making processes such as catalog design, cross-marketing, and
customer shopping behavior analysis.

Atypical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items
that customers place in their “shopping baskets”. The discovery of these associations
can help retailers develop marketing strategies by gaining insight into which items are
frequently purchased together by customers.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Example: Market basket analysis. Suppose, as manager of an AllElectronics branch,
you would like to learn more about the buying habits of your customers. Specifically,
you wonder, “Which groups or sets of items are customers likely to purchase on a
given trip to the store?” To answer your question, market basket analysis may be
performed on the retail data of customer transactions at your store. You can then use
the results to plan marketing or advertising strategies, or in the design of a new
catalog. For instance, market basket analysis may help you design different store
layouts. In one strategy, items that are frequently purchased together can be placed in
proximity to further encourage the combined sale of such items. If customers who
purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales
of both items.

Support and Confidence Measures:


Rule support and confidence are two measures of rule interestingness. They
respectively reflect the usefulness and certainty of discovered rules. Typically,
association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. These thresholds can be a set by
users or domain experts. Additional analysis can be performed to discover interesting
statistical correlations between associated items.
Support:
The rule A⇒B holds in the transaction set D with support s, where s is the percentage
of transactions in D that contain A∪B (i.e., the union of sets A and B say, or, both A
and B). This is taken to be the probability, P(A∪B).
Support (A⇒B) = P(A∪B) = Support(A∪B) / N
Where N is the number of transactions in Transactional database.
Confidence:
The rule A⇒B has confidence c in the transaction set D, where c is the percentage of
transactions in D containing A that also contain B. This is taken to be the conditional
probability, P(B|A). That is
Confidence (A⇒B) = P(B|A) = Support(A∪B) / Support(A)

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Where support is the number of occurrence frequency of item in the transactional
database.
A set of items is referred to as an itemset. An itemset that contains k items is a k-
itemset. The set {computer, antivirus software} is a 2-itemset. The occurrence
frequency of an itemset is the number of transactions that contain the itemset. This is
also known, simply, as the frequency, support count, or count of the itemset.
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least
as frequently as a predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets: By definition,
these rules must satisfy minimum support and minimum confidence.
Frequent Itemset Mining Methods:
There are two Mining Methods for Frequent Itemsets
1. Apriori Algorithm( with candidate generation)
2. Frequent pattern-growth( without candidate generation)

Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate


Generation:

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for


mining frequent itemsets for Boolean association rules. The name of the algorithm is
based on the fact that the algorithm uses prior knowledge of frequent itemset
properties. Apriori employs an iterative approach known as a level-wise search, where
k-itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the data base to accumulate
the count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can
be found. The finding of each Lk requires one full scan of the database.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an itemset
I does not satisfy the minimum support threshold, min_sup, then I is not frequent,
that is, P(I) < min_sup. If an item A is added to the itemset I, then the resulting

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


itemset (i.e., I∪A) cannot occur more frequently than I. Therefore, I∪A is not frequent
either, that is, P(I∪A) < min_sup.
Apriori is a two-step process is followed, consisting of join and prune actions.
1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining
Lk−1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in
Lk−1.
2. The prune step: Ck is a superset of Lk, that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in Ck. A database scan to
determine the count of each candidate in Ck would result in the determination of Lk
(i.e., all candidates having a count no less than the minimum support count are
frequent by definition, and therefore belong to Lk).
Example Apriori. Let’s look at a concrete example, based on the AllElectronics
transaction database, D, of Table 6.1. There are nine transactions in this database,
that is, |D|=9. We use Figure 6.2 to illustrate the Apriori algorithm for finding
frequent itemsets in D.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


1. In the first iteration of the algorithm, each item is a member of the set of candidate
1-itemsets, C1. The algorithm simply scans all of the transactions to count the
number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup=2. The set
of frequent 1-itemsets, L1, canthen be determined. It consists of the candidate 1-
itemsets satisfying minimum support. In our example, all of the candidates in C1
satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 * L1 to
generate a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no
candidates are removed from C2 during the prune step because each subset of the
candidates is also frequent.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


4. Next, the transactions in D are scanned and the support count of each candidate
itemset in C2 is accumulated, as shown in the middle table of the second row in
Figure.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate 2-itemsets in C2 having minimum support.
6. The generation of the set of the candidate 3-itemsets, C3, is detailed in Figure 6.3.
From the join step, we first get C3 =L2 1 L2 ={{I1, I2, I3},{I1, I2, I5},{I1, I3, I5}, {I2, I3,
I4}, {I2, I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a frequent
itemset must also be frequent, we can determine that the four latter candidates
cannot possibly be frequent. We therefore remove them from C3, thereby saving the
effort of unnecessarily obtaining their counts during the subsequent scan of D to
determine L3. Note that when given a candidate k-itemset, we only need to check if its
(k−1)-subsets are frequent since the Apriori algorithm uses a level-wise search
strategy. The resulting pruned version of C3 is shown in the first table of the bottom
row of Figure.
7. The transactions in D are scanned to determine L3, consisting of those candidate 3-
itemsets in C3 having minimum support (Figure 6.2). 8. The algorithm uses L3*L3 to
generate a candidate set of 4-itemsets, C4. Although the join results in{{I1, I2, I3, I5}},
itemset {I1, I2, I3, I5} is pruned because its subset {I2,I3,I5} is not frequent. Thus, C4
=φ, and the algorithm terminates, having found all of the frequent itemsets.
Generating association rules: Let’s try an example based on the transactional data
for AllElectronics shown before in Table. The data contain frequent itemset X ={I1, I2,
I5}. What are the association rules that can be generated fromX? The non empty
subsets of X are {I1, I2},{I1, I5},{I2, I5},{I1},{I2}, and {I5}. The resulting association rules
are as shown below, each listed with its confidence:
{I1,I2}⇒I5, confidence=2/4=50%
{I1,I5}⇒I2, confidence=2/2=100%
{I2,I5}⇒I1, confidence=2/2=100%
I1⇒{I2,I5}, confidence=2/6=33%
I2⇒{I1,I5}, confidence=2/7=29%
I5⇒{I1,I2}, confidence=2/2=100%

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


If the minimum confidence threshold is, say, 70%, then only the second, third, and
last rules are output, because these are the only ones generated that are strong. Note
that, unlike conventional classification rules, association rules can contain more than
one conjunct in the right side of the rule.
Algorithm: Apriori. Find frequent itemsets using an iterative level-wise approach
based on candidate generation.
Input:
 D, a database of transactions;
 Min_sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
Method:
(1) L1 = find frequent 1-itemsets(D);
(2) for (k=2;Lk−1 6=φ;k++){
(3) Ck = apriori gen(Lk−1);
(4) for each transaction t ∈D{// scan D for counts
(5) Ct = subset(Ck, t); // get the subsets of t that are candidates
(6) for each candidate c∈Ct
(7) c.count++;
(8) }
(9) Lk ={c ∈Ck|c.count ≥min sup}
(10) }
(11) return L=∪kLk;
Improving the Efficiency of Apriori: Many variations of the Apriori algorithm have
been proposed that focus on improving the efficiency of the original algorithm. Several
of these variations are summarized as follows:
Hash-based technique (hashing itemsets into corresponding buckets): A hash-
based technique can be used to reduce the size of the candidate k-itemsets, Ck, for k
> 1. For example, when scanning each transaction in the database to generate the
frequent 1-itemsets, L1, we can generate all the 2-itemsets for each transaction, hash
(i.e., map) them into the different buckets of a hash table structure, and increase the
corresponding bucket counts (Figure 6.5). A 2-itemset with a corresponding bucket
count in the hash table that is below the support threshold cannot be frequent and

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


thus should be removed from the candidate set. Such a hash-based technique may
substantially reduce the number of candidate k-itemsets examined (especially when
k=2).

Transaction reduction: (reducing the number of transactions scanned in future


iterations): A transaction that does not contain any frequent k-itemsets cannot
contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or
removed from further consideration because subsequent database scans for j-
itemsets, where j > k, will not need to consider such a transaction.
Partitioning (partitioning the data to find candidate itemsets):
A partitioning technique can be used that requires just two database scans to mine
the frequent itemsets (Figure). It consists of two phases. In phase I, the algorithm
divides the transactions of D into n non overlapping partitions. If the minimum
relative support threshold for transactions in D is min_sup, then the minimum
support count for a partition is min_sup×the number of transactions in that partition.
For each partition, all the local frequent itemsets (i.e.,the itemsets frequent within the
partition) are found.
A local frequent itemset may or may not be frequent with respect to the entire
database, D. However, any itemset that is potentially frequent with respect to D must
occur as a frequent itemset in at least one of the partitions. Therefore, all local
frequent itemsets are candidate itemsets with respect to D. The collection of frequent
itemsets from all partitions forms the global candidate itemsets with respect to D. In
phase II, a second scan of D is conducted in which the actual support of each

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


candidate is assessed to determine the global frequent itemsets. Partition size and the
number of partitions are set so that each partition can fit into main memory and
therefore be read only once in each phase.

Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy
against efficiency. The S sample size is such that the search for frequent itemsets in S
can be done in main memory, and so only one scan of the transactions in S is
required overall. Because we are searching for frequent itemsets in S rather than in D,
it is possible that we will miss some of the global frequent itemsets.
Dynamic itemset counting (adding candidate itemsets at different points during
a scan): A dynamic itemset counting technique was proposed in which the database is
partitioned into blocks marked by start points. In this variation, new candidate
itemsets can be added at any start point, unlike in Apriori, which determines new
candidate itemsets only immediately before each complete database scan. The
technique uses the count-so-far as the lower bound of the actual count. If the count-
so-far passes the minimum support, the itemset is added into the frequent itemset
collection and can be used to generate longer candidates. This leads to fewer database
scans than with Apriori for finding all the frequent itemsets.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


A Pattern-Growth Approach for Mining Frequent Itemsets:
As we have seen, in many cases the Apriori candidate generate-and-test method
significantly reduces the size of candidate sets, leading to good performance gain.
However, it can suffer from two nontrivial costs:
 It may still need to generate a huge number of candidate sets. For example,
if there are 104 frequent 1-itemsets, the Apriori algorithm will need to generate
more than 107 candidate 2-itemsets.
 It may need to repeatedly scan the whole database and check a large set of
candidates by pattern matching. It is costly to go over each transaction in the
database to determine the support of the candidate itemsets.
“Can we design a method that mines the complete set of frequent itemsets without
such a costly candidate generation process?” An interesting method in this attempt is
called frequent pattern growth, or simply FP-growth, which adopts a divide-and-
conquer strategy as follows. First, it compresses the database representing frequent
items into a frequent pattern tree, or FP-tree, which retains the itemset association
information. It then divides the compressed database into a set of conditional
databases (a special kind of projected database), each associated with one frequent
item or “pattern fragment,” and mines each database separately. Foreach “pattern
fragment,” only its associated datasets need to be examined. Therefore, this approach
may substantially reduce the size of the data sets to be searched, along with the
“growth” of patterns being examined.
Example: FP-growth (finding frequent itemsets without candidate generation) We
reexamine the mining of transaction database, D, of Table in Apriori Example using
the frequent pattern growth approach.
The first scan of the database is the same as Apriori, which derives the set of
frequent items (1-itemsets) and their support counts (frequencies). Let the minimum
support count be 2. The set of frequent items is sorted in the order of descending
support count. This resulting set or list is denoted by L. Thus, we have L={{I2: 7}, {I1:
6}, {I3: 6}, {I4: 2},{I5: 2}}.
An FP-tree is then constructed as follows. First, create the root of the tree,
labeled with “null.” Scan database D a second time. The items in each transaction are

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


processed in L order (i.e., sorted according to descending support count), and a
branch is created for each transaction.

We first consider I5, which is the last item in L, rather than the first. The reason for
starting at the end of the list will become apparent as we explain the FP-tree mining
process. I5 occurs in two FP-tree branches of Figure.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


The FP-growth method transforms the problem of finding long frequent patterns into
searching for shorter ones in much smaller conditional databases recursively and
then concatenating the suffix. It uses the least frequent items as a suffix, offering good
selectivity. The method substantially reduces the search costs.
When the database is large, it is sometimes unrealistic to construct a main
memory based FP-tree. An interesting alternative is to first partition the database into
a set of projected databases, and then construct an FP-tree and mine it in each
projected database.
This process can be recursively applied to any projected database if its FP-tree
still cannot fit in main memory. A study of the FP-growth method performance shows
that it is efficient and scalable for mining both long and short frequent patterns, and
is about an order of magnitude faster than the Apriori algorithm.
Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment
growth.
Input:
 D, a transaction database;
 Min_sup, the minimum support count threshold.
Output: The complete set of frequent patterns.
Method:
1. The FP-tree is constructed in the following steps:
(a) Scan the transaction database D once. Collect F, the set of frequent items,
and their support counts. Sort F in support count descending order as L, the
list of frequent items.
(b) Create the root of an FP-tree, and label it as “null.” For each transaction
Trans in D do the following.
Select and sort the frequent items in Trans according to the order of L. Let the
sorted frequent item list in Trans be [p|P], where p is the first element and P is
the remaining list. Call insert tree([p|P], T), which is performed as follows. If T
has a child N such that N.item-name=p.item-name, then increment N’s count
by 1; else create a new node N, and let its count be 1, its parent link be linked
to T, and its node-link to the nodes with the same item-name via the node-link
structure. If P is nonempty, call insert tree(P, N) recursively.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


2. The FP-tree is mined by calling FP growth(FP tree, null), which is
implemented as follows.
Procedure FP growth(Tree, α)
(1) if Tree contains a single path P then
(2) for each combination (denoted as β) of the nodes in the path P
(3) generate pattern β∪α with support_count = minimum support_count of nodes in β;

(4) else for each ai in the header of Tree {


(5) generate pattern β = ai ∪ α with support_count =ai.support_count;
(6) construct β’s conditional pattern base and then β’s conditional FP tree Treeβ;
(7) if Treeβ 6=∅then
(8) call FP growth(Treeβ, β); }

Mining Closed and Max Patterns:


Frequent itemset: it more frequently occurs. An item set is called frequent if its
support is equal to greater than an agreed minimum support value.

1-itemset: if the itemset has only one item in set.


2-itemset: if the itemset contains only two data items in set.

Given Min_sup=50%
Support(A) = 3/4 =75%
Support(B) = 1/4 =25%
Support(C) = 2/4 =50%
Support(D) = 2/4 =50%
Support(E) = 2/4 =50%

So items A,C,D,E are satisfying 50% min_sup so these items are frequent.
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
There are two such representations
1. Maximal Frequent Itemsets:
 An itemset is maximal frequent if none of its immediate supersets is
frequent.
2. Closed Frequent Itemsets:
 An itemset is closed if none of its immediate supersets has exactly the same
support count.
 Closed itemsets provide a minimal representation of itemsets without
losing their support information.
Maximal frequent itemset:
Maximal frequent itemset is a frequent itemset for which none of its immediate
supersets are frequent.

Closed frequent itemset:


A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern
Y ⊃ X, with the same support as X". Closed frequent itemset is a frequent itemset for
which none of its immediate supersets have the same support count as itself.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Given Min_sup = 2
1. {A} is closed because none of its supersets have the same support as itself, but
{A} is not maximal because {A, D} is a superset of {A} and its frequent.
2. {B} is not frequent.
3. {C} is closed because none of its supersets have the same support as itself.
4. {D} is not closed because its immediate superset {A, D} also has same support
count 2.
5. {E} is closed because none of its supersets have the same support as itself.
6. {A}, {C}, {E} and {A, D} are closed frequent itemsets.
7. {C}, {E} and {A, D} are maximal frequent itemsets.

All maximal frequent itemsets are closed frequent itemsets but the converse is not
true.
Frequent itemset

Closed frequent itemset

Maximal frequent
itemsets

Fig: Relationship among frequent itemsets

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Mining Various Kinds of Association Rules:
Association rule is used for discovering interesting relationships between variables in
large databases. It is designed to detect strong rules in the database based on some
interesting metrics. For any given multi-item transaction, association rules aim to
obtain rules that determine how or why certain items are linked.
Types of Association Rules:
There are various types of association rules in data mining:-
 Multi-relational association rules

 Generalized association rules

 Quantitative association rules

 Interval information association rules

Mining multilevel association rules:


 Using concept hierarchy, transaction data can be represented at various levels
of abstraction.
 In Multilevel Association rule, Association rules are generated at multiple levels
of abstraction.
 Instead of going at lower level of abstraction Association rules are generated
from higher level of abstraction which is represents common sense knowledge
and be used efficiently.
Concept Hierarchy:

 It is a sequence of mappings from a set of low level concepts to higher


level, more general concept. below figure has five levels from level 0 to 4

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Approaches to Multilevel association rule mining:
1. Uniform Support(Using uniform minimum support for all level)
2. Reduced Support (Using reduced minimum support at lower levels)
3. Group-based Support(Using item or group based support)

Mining Multidimensional Association Rules:


Association rules with two or more dimensions or predicates can be referred to as
multidimensional association rules.

Age (X, "20...29") ^occupation (X,"Student") =>buys (X,"Laptop")

This rule contains three predicates (age, occupation, and buys), each of which
occurs only once in the rule, such rules are called inter-dimensional association
rules.
The rules with repeated predicates or containing multiple occurrences of some
predicates are called hybrid-dimension association rules.
For example: Age (X, "20...29") ^buys (X,"Laptop") =>buys (X,"Printer")
This rule contains two predicates (age and buys)
The cells of an n-dimensional cuboid correspond to the predicate sets.

Mining from data cubes can be much faster.


The database attributes should be categorical or quantitative.
 Categorical attributes have a finite number of possible values, with no
ordering among the values also called nominal attributes.
 Quantitative attributes are numeric and have an implicit sequencing between
values.
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
Correlation:
The Correlation is a measure of association between two variables. Correlations are
Positive and negative which are ranging between
+1 and -1.
Degree and type of relationship between any two or more quantities (variables) in
which they vary together over a period. For example, variation in the level of
expenditure or savings with variation in the level of income.
 A positive correlation exists where the high values of one variable are
associated with the high values of the other variable(s).
 A 'negative correlation' means association of high values of one with the low
values of the other(s).
 Values close to +1 indicate a high degree of positive correlation, and values
close to -1 indicate a high degree of negative correlation.
Interestingness Measure: Correlations (Lift): lift(A,
lift(A, B) =B)
lift(A,
B) = P(AB)
P(AB)
== /P(A)P(B)
P(AB) /P(A)P(B)
/P(A)P(B)
lift(A, B) P(AB) /P(A)P(B)
Lift is a measure of how much more likely the consequent of a rule is to occur when
the antecedent is present, compared to when it is absent. It is the ratio of the
confidence of the rule, to the frequency of the consequent in the whole dataset.

lift(A, B) = P(AB) /P(A)P(B)

lift(A, B) = P(AB) /P(A)P(B)


Faculty: Mr. D. Krishna, Associate Professor CSE Dept
Constraint-Based Association Mining:
Data mining allows for extracting thousands of rules that appear to be important from
a dataset; nevertheless, likely, the majority of these rules will not provide customers
with any value. Users often have a clear concept of the "form" of the patterns or rules
they want to discover and the "direction" of mining that may lead to intriguing pattern
discoveries. It's also possible that they have a preconceived understanding of what the
"conditions" of the rules are, which would prohibit them from seeking rules that they
already know are irrelevant to the situation. As a result of this, a useful heuristic is to
have users pick constraints based on their own intuition or preconceptions about
what should be allowed. This approach is referred to as "constraint-based mining,"
which is an industry term. The following are some illustrative examples of potential
restrictions:
 Knowledge Type of Constraint: These characteristics, which may include
association, correlation, categorization, or grouping, describe the nature of the
knowledge that is to be mined.
 Data Constraint: These characteristics are used to determine the information
that is required to finish a job. These constraint attempts can be guided in the
right direction by imposing constraints, such as limitations on the dimensions
or layers of the data, abstractions, or thought hierarchies.
 Interestingness Constraints: Limitations on interestingness are utilized in the
process of establishing minimum and maximum values for statistica measures
of rule interestingness, including support, confidence, and correlation.
Limitations are placed on the interestingness of rules.
 Rule Constraint: The form or needs of the rules are outlined by the constraints
that are placed on the rules to be mined.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Graph Pattern Mining:
Graph mining is a process in which the mining techniques are used in finding a
pattern or relationship in the given real-world collection of graphs. By mining the
graph, frequent substructures and relationships can be identified which helps in
clustering the graph sets, finding a relationship between graph sets, or
discriminating or characterizing graphs. Predicting these patterning trends can help
in building models for the enhancement of any application that is used in real-time.
To implement the process of graph mining, one must learn to mine frequent
subgraphs.
Frequent Subgraph Mining
Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider
the existence of subgraph isomorphism from h to h’ in such a way that h is a
subgraph of h’. A label function is a function that plots either the edges or vertices to

a label. Let us consider a labeled graph dataset, Let


us consider s(h) as the support which means the percentage of graphs in F where h
is a subgraph. A frequent graph has support that will be no less than the minimum
support threshold. Let us denote it as min_support.
Steps in finding frequent subgraphs:
There are two steps in finding frequent subgraphs.
 The first step is to create frequent substructure candidates.

 The second step is to find the support of each and every candidate. We must

optimize and enhance the first step because the second step is an NP-
completed set where the computational complexity is accurate and high.
Sequential Pattern Mining:
A sequence database consists of ordered elements or events. For example, a customer
first buys bread, then eggs and cheese, and then milk. This forms a sequence
consisting of three ordered events. We consider an event or a subsequent event is
frequent if its support, which is the number of sequences that contain this event or
subsequence, is greater than a certain value. This algorithm finds patterns in input
sequences satisfying user defined minimum support.
The task of sequential pattern mining is a data mining task specialized for
analyzing sequential data, to discover sequential patterns. More precisely, it
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
consists of discovering interesting subsequences in a set of sequences, where the
interestingness of a subsequence can be measured in terms of various criteria such as
its occurrence frequency, length, and profit. Sequential pattern mining has
numerous real-life applications due to the fact that data is naturally encoded
as sequences of symbols in many fields such as bioinformatics, e-learning, market
basket analysis, texts, and webpage click-stream analysis. Consider the task
of sequential pattern mining with an example. Consider the following sequence
database, representing the purchases made by customers in a retail store.

This database contains four sequences. Each sequence represents the items
purchased by a customer at different times. A sequence is an ordered list of itemsets
(sets of items bought together). For example, in this database, the first sequence (SID
1) indicates that a customer bought some items a and b together, then purchased an
item c, then purchased items f and g together, then purchased an item g, and then
finally purchased an item e.
Traditionally, sequential pattern mining is being used to find subsequences that
appear often in a sequence database, i.e. that are common to several sequences.
Those subsequences are called the frequent sequential patterns.
For example, in the context of our example, sequential pattern mining can be used
to find the sequences of items frequently bought by customers. This can be useful to
understand the behavior of customers to take marketing decisions.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept


Short Questions

1. Define frequent patterns?


2. Define closed itemset?
3. State maximal frequent itemset?
4. State closed frequent itemset?
5. List the techniques of efficiency of Apriori algorithm?
6. Define itemset?
7. Name the steps in association rule mining?
8. Explain the join step?
9. Describe the prune step?
10. State Apriori Principle.
11. Define Candidate Itemset.
12. Define Support and Confidence.
13. Define the terms frequent itemsets, closed itemsets and association rules?
14. List the applications of association rule mining.
15. What is constraint based mining?
16. Describe the five categories of pattern mining constraints?
17. Define the passes in FP-Growth algorithm.
18. Give the limitations of Apriori algorithm
19. Gibe the advantages and disadvantages of Apriori algorithm.

Essay Questions

1. List the advantages of FP-Growth algorithm.


2. What are the applications of Association Rule Mining?
3. Compare FP-Growth algorithm and Apriori algorithm.
4. Discuss the generating association rules from frequent itemsets.
5. Discuss about constraint-based association mining?
6. Explain what are additional rule constraints to guide mining?
7. Describe about the Mining closed Frequent Itemset.
8. Give the limitations of APRIORI algorithm.
9. Explain APRIORI algorithm with an example.
10. Compare and Contrast the differences between mining multilevel association rules from
transaction databases and relational databases.
11. Write a short example to show that items in a strong association rule may actually be negatively
correlated.
12. Explain Association rule mining often generates a large number of rules. Discuss effective
methods that can be used to reduce the number of rules generated while still preserving most of
the interesting rules.
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
13. Discuss which algorithm is an influential algorithm for mining frequent itemsets for boolean
association rules? Explain with an example?
14. Describe the different techniques to improve the efficiency of Apriori? Explain?
15. Discuss the FP-growth algorithm? Explain with an example?
16. Explain how to mine the frequent itemsets using vertical data format?
17. Explain, how can we tell which strong association rules are really interesting? Explain with an
example?
18. Discuss about mining multilevel association rules from transaction databases in detail?
19. Explain how to mine the multidimensional association rules from relational databases and data
warehouses?
20. Discuss about mining multilevel association rules from transaction databases in detail?
21. Apply the following rules on a database has five transactions. Let min_sup = 60% and
min_conf=80%
TID items bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y }
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I ,E}
(a) Find all frequent item sets using Apriori and FP-growth, respectively. Compare the efficiency of
the two mining processes.

(b) List all of the strong association rules (with support s and confidence c) matching the following
meta rule, where X is a variable representing customers, and itemi denotes variables
representing items (e.g., “A”, “B”, etc.): Vx belongs transaction, buys(X, item1)^buys(X,
item2))buys(X, item3) [s, c].

22.

Faculty: Mr. D. Krishna, Associate Professor CSE Dept

You might also like