0% found this document useful (0 votes)
31 views

Unit 3 Final

This document discusses association rule mining, which aims to find frequent patterns and correlations in transactional databases. It defines key concepts like support, confidence and lift used to evaluate association rules. Association rule mining is useful for applications like market basket analysis, where rules can reveal products commonly purchased together. The document also covers single and multi-dimensional association rules, as well as classification based on abstraction level and data type.

Uploaded by

SOORAJ CHANDRAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Unit 3 Final

This document discusses association rule mining, which aims to find frequent patterns and correlations in transactional databases. It defines key concepts like support, confidence and lift used to evaluate association rules. Association rule mining is useful for applications like market basket analysis, where rules can reveal products commonly purchased together. The document also covers single and multi-dimensional association rules, as well as classification based on abstraction level and data type.

Uploaded by

SOORAJ CHANDRAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit 3

Association Rule Mining – Mining Single Dimensions Boolean Association Rules from
Transactional Databases – Multilevel Association Rule – Classification and
Prediction – Classification by Decision Tree Induction – Bayesian Classification –
Predictive.

Association Rule Mining

Association rule mining is a procedure which aims to observe frequently occurring


patterns, correlations, or associations from datasets found in various kinds of
databases such as relational databases, transactional databases, and other forms
of repositories.
An association rule has 2 parts:
• an antecedent (if) and
• a consequent (then) An antecedent is something that’s found in data, and a
consequent is an item that is found in combination with the antecedent. Have a
look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent.
Simply put, it can be understood as a retail store’s association rule to target their
customers better. If the above rule is a result of a thorough analysis of some data sets,
it can be used to not only improve customer service but also improve the company’s
revenue.
Association rules are created by thoroughly analyzing data and looking for frequent
if/then patterns. Then, depending on the following two parameters, the important
relationships are observed:
1. Support: Support indicates how frequently the if/then relationship appears in the
database.
2. Confidence: Confidence tells about the number of times these relationships have
been found to be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to
find the rules that govern how or why such products/items are often bought together.
For example, peanut butter and jelly are frequently purchased together because a lot of
people like to make PB&J sandwiches.
A Beginner’s Guide to Data Science and Its Applications
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it
was the first application area of association mining. The aim is to discover
associations of items occurring together more often than you’d expect from randomly
sampling all the possibilities. The classic anecdote of Beer and Diaper will help in
understanding this better.
Association rule mining finds interesting associations and relationships among large
sets of data items. This rule shows how frequently a itemset occurs in a transaction. A
typical example is Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show
associations between items.It allows retailers to identify relationships between the
items that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.

Before we start defining the rule, let us first see the basic definition
Support COUNT(σ)– Frequency of occurrence of a
itemset. Here σ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup
threshold. Association Rule – An implication expression of the form X -> Y, where
X and Y are any 2 itemsets.
Example: {Milk, Diaper}->{ Beer}
Rule Evaluation Metrics –
• Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the
rule as a percentage of the total number of transaction.It is a measure of how
frequently the collection of items occur together as a percentage of all
transactions.
• Support = σ (X+Y) ÷ total –
It is interpreted as fraction of transactions that contain both X and Y.
• Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as
the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}.
• Conf(X=>Y) = Supp(XUY) ÷ Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X
also.
•Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each
other.The expected confidence is the confidence divided by the frequency of
{Y}.
• Lift(X=>Y) = Conf(X=>Y) ÷ Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected,
greater than 1
means they appear together more than expected and less than 1 means they appear
less than expected.Greater lift values indicate stronger association. Example – From

= 2/5
= 0.4

c= σ(Milk, Diaper, Beer) / (Milk, Diaper)


= 2/3
= 0.67

l= Supp({Milk, Diaper, Beer}) / Supp({Milk, Diaper})*Supp({Beer})


= 0.4/(0.6*0.6)
= 1.11
the above table,
{Milk, Diaper}=>{Beer} s= σ({Milk, Diaper, Beer}) / |T|
The Association rule is very useful in analyzing datasets. The data is collected using bar
-code scanners in supermarkets. Such databases consists of a large number of
transaction records which list all items bought by a customer on a single purchase. So
the manager could know if certain groups of items are consistently purchased together
and use this data for adjusting store layouts, cross-selling, promotions based on
statistics.

Mining single-dimensional Boolean association rules from transactional databases


and multilevel Association rule

What is association mining?

Association mining aims to extract interesting correlations, frequent patterns, associations or


casual structures among sets of items or objects in transaction databases, relational
database or other data repositories. Association rules are widely used in various areas such as
telecommunication networks, market and risk management, inventory control, cross-
marketing, catalog design, loss-leader analysis, clustering, classification, etc.

Examples:
Rule form: “Body⇒head *support, confidence+”. Buys

(x,”diapers”) ⇒buys(x, “beers”) *0.5%, 60%+

major(x, “cs”) ^ takes(x, “DB”) ⇒grade(x, “A”) *1%, 75%+

Association rule: basic concepts:


• Given: (1) database of transaction, (2) each transaction is a list of items
(purchased by a customer in visit)
• Find: all rules that correlate the presence of one set of items with that of
another set of items.
⎯ E.g., 98% of people who purchase tires and auto accessories also get
automotive services done.
⎯ E.g., Market Basket Analysis
This process analyzes customer buying habits by finding
associations between the different items that customers place in
their “Shopping Baskets”. The discovery of such associations can
help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customer.

• Applications
⎯ *⇒maintenance agreement (what the store should do to boost
maintenance agreement sales)
⎯ Home electronics ⇒* (what other products should the store stocks up?)
⎯ Attached mailing in direct marketing
⎯ Detecting “ping-pong” ing of patients, faulty “collisions”

RULE Measures: supports and confidence

Support: percentage of transaction in D that contain AUB.

Confidence: percentage of transaction in D containing A that also contains B.

Support (A ⇒B) = p (A ⇒B)

Confidence (A⇒ B) = P (B/A).

Rules that satisfy both a minimum supports threshold (min_sup) and a


minimum confidence threshold (min_conf) are called strong

LET minimum, support 50%, and minimum confidence 50%, we have

A C (50%, 66.6%)

C A (50%, 100%)

In general, association rules mining can be viewed as a twostep process:


1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets:
By definition, these rules must satisfy minimum support and minimum confidence.

Classification of association rules mining:

• Based on the level of abstraction involved in the rules set:


⎯ Single level association rules refer items or attribute at only one level.
Buys (X, “computer”) ⇒buys(X, “HP printer”)
⎯ Multi-level association rules reference items or attribute at different levels of
abstraction.
Buys(X, “laptop computer”)⇒ buys(X, “HP printer”)

• Based on the number of data dimensions involved in the rules:


⎯ Single dimensional Association rule is an association rule in which items or
attribute reference only one dimension.
Buys (X, “computer”) ⇒ buys (X,”antivirus software”)

⎯ Multidimensional association rule reference two or more dimensions age


(X, “30….39”)^income(X, “42k…48k”)⇒buys(X, “high resolution TV”)
• Based on the types of the values handled in rule:
⎯ Boolean association rule involve associations between the presence and
absence of items.
buys (X, “SQLServer”) ^ buys (X, “DMBook”) ⇒ buys(X, “DBMiner”)

⎯ Quantitative association rule describe association between quantitative items


or attributes.
Age (X, “30…39”) ^ income(X, “42k…49k”) ⇒ buys(X, “PC”)

Data Mining - Classification & Prediction

There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −

• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to
predict the expenditures in dollars of potential customers on computer equipment
given their income and occupation.
What is classification?

Following are the examples of cases where the data analysis task is Classification − • A
bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.

What is prediction?

Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will
spend during a sale at his company. In this example we are bothered to predict a
numeric value. Therefore the data analysis task is an example of numeric prediction. In
this case, a model or a predictor will be constructed that predicts a continuous-valued-
function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for
numeric prediction.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps −

• Building the Classifier or Model


• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database tuples and their
associated class labels.

• Each tuple that constitutes the training set is referred to as a category or class.
These tuples can also be referred to as sample, object or data points.
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be applied to
the new data tuples if the accuracy is considered acceptable.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the
data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the
following methods. o Normalization − The data is transformed using
normalization. Normalization involves scaling all values for given attribute in
order to make them fall within a small specified range. Normalization is used
when in the learning step, the neural networks or the methods involving
measurements are used. o Generalization − The data can also be transformed by
generalizing it to the higher concept. For this purpose we can use the concept
hierarchies.
Note − Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the
class label correctly and the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understand

Data Mining - Decision Tree Induction

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node
represents a test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.

Tree Pruning

Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches

There are two approaches to prune a tree −


• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity

The cost complexity is measured by the following two parameters −

• Number of leaves in the tree, and


• Error rate of the tree

Data Mining - Bayesian Classification


Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the
statistical classifiers. Bayesian classifiers can predict class membership probabilities
such as the probability that a given tuple belongs to a particular class.

Baye's Theorem

Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

• Posterior Probability [P(H/X)]


• Prior Probability [P(H)] where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)

Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions. They are
also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
• A Belief Network allows class conditional independencies to be defined between
subsets of variables.
• It provides a graphical model of causal relationship on which learning can be
performed.
• We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −

• Directed acyclic graph


• A set of conditional probability tables

Directed Acyclic Graph

• Each node in a directed acyclic graph represents a random variable.


• These variable may be discrete or continuous valued.
• These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung
cancer is influenced by a person's family history of lung cancer, as well as whether or
not the person is a smoker. It is worth noting that the variable PositiveXray is
independent of whether the patient has a family history of lung cancer or that the
patient is a smoker, given that we know the patient has lung cancer.

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC)
showing each possible combination of the values of its parent nodes, FamilyHistory
(FH), and Smoker
(S) is as follows −

Definition Predictive Data Mining means?


Predictive data mining is data mining that is done for the purpose of using business
intelligence or other data to forecast or predict trends. This type of data mining can
help business leaders make better decisions and can add value to the efforts of the
analytics team.

You might also like