0% found this document useful (0 votes)
14 views

02 - Data Mining

The document discusses several common data mining tasks including classification, clustering, association rule discovery, sequential pattern discovery, regression, and deviation detection. It provides definitions and examples of how each task can be applied to real-world problems.

Uploaded by

Dd d
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

02 - Data Mining

The document discusses several common data mining tasks including classification, clustering, association rule discovery, sequential pattern discovery, regression, and deviation detection. It provides definitions and examples of how each task can be applied to real-world problems.

Uploaded by

Dd d
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Mining:

Data Mining Tasks


& Data

Data Mining - Lecture 2


Data Mining Tasks

 Prediction Tasks
 Use some variables to predict unknown or future values of other
variables
 Description Tasks
 Find human-interpretable patterns that describe the data.

Common data mining tasks


 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

Data Mining - Lecture 2 2


Classification: Definition

 Given a collection of records (training set )


 Each record contains a set of attributes, one of the attributes is the
class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with training
set used to build the model and test set used to validate it.

Data Mining - Lecture 2 3


Classification Example
Refund Marital Taxable
Status Income Cheat Cheat

No Single 75K ? No
Yes Married 50K ? No
Tid Refund Marital Taxable
Status Income Cheat No Married 150K ? No

Yes Divorced 90K ? Yes


1 Yes Single 125K No
No Single 40K ? No
2 No Married 100K No
No Married 80K ? No
3 No Single 70K No 10
10

4 Yes Married 120K No


5 No Divorced 95K Yes
6 No Married 60K No Test
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier

Data Mining - Lecture 2 4


Classification: Application 1

 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
 Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
 Type of business, where they stay, how much they earn, etc.
 Use this information as input attributes to learn a classifier
model.

Data Mining - Lecture 2 5


Classification: Application 2

 Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions.
 Approach:
 Use credit card transactions and the information on its
account-holder as attributes.
 When does a customer buy, what does he buy, how often he
pays on time, etc
 Label past transactions as fraud or fair transactions. This forms
the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card
transactions on an account.

Data Mining - Lecture 2 6


Clustering Definition

 Given a set of data points, each having a set of


attributes, and a similarity measure among
them, find clusters such that
 Data points in one cluster are more similar to one
another.
 Data points in separate clusters are less similar to one
another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

Data Mining - Lecture 2 7


Clustering: Application 1

 Market Segmentation:
 Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
 Approach:
 Collect different attributes of customers based on
their geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.

Data Mining - Lecture 2 8


Clustering: Application 2

 Document Clustering:
 Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
 Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of different
terms. Use it to cluster.
 Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.

Data Mining - Lecture 2 9


Association Rule Discovery: Definition
 Given a set of records each of which contain some
number of items from a given collection;
 Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Juice, Bread {Milk} --> {Coke}
3 Juice, Coke, Diaper, Milk {Diaper, Milk} --> {Juice}
4 Juice, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Data Mining - Lecture 2 10


Association Rule Discovery: Application 1

 Marketing and Sales Promotion:


 Let the rule discovered be
{Cookies, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
 Cookies in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling Cookies.
 Cookies in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Cookies to promote sale of Potato chips!

Data Mining - Lecture 2 11


Association Rule Discovery: Application 2

 Supermarket shelf management.


 Goal: To identify items that are bought
together by sufficiently many customers.
 Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he is very
likely to buy juice.
 So, don’t be surprised if you find six-packs stacked
next to diapers!

Data Mining - Lecture 2 12


Sequential Pattern Discovery: Definition

Given is a set of objects, with each object associated with


its own timeline of events, find rules that predict strong
sequential dependencies among different events:

 In point-of-sale transaction sequences,


 Computer Bookstore:

(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies)

 Athletic Apparel Store:


(Shoes) (Racket, Racketball) --> (Sports_Jacket)

Data Mining - Lecture 2 13


Regression
 Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
 Greatly studied in statistics, neural network
fields.
 Examples: age of a person?
 Predicting the age of a person based on
MaritalStatus, NumberOfChildren, Income,…
 E.g., If MaritalStatus=Yes, Age = 20

+4*NumberOfChildren+0.0001*Income+…
 Predicting wind velocities as a function of
temperature, humidity, and pressure.
Data Mining - Lecture 2 14
Deviation/Anomaly Detection

 Detect significant deviations


from normal behavior
 Applications:
 Credit Card Fraud Detection

Data Mining - Lecture 2 15


What is Data?
Attributes
 Collection of data objects and
their attributes.
Tid Refund Marital Taxable
 An attribute is a property or Status Income Cheat
characteristic of an object
1 Yes Single 125K No
 Examples: eye color of a
person, temperature, etc. 2 No Married 100K No

 Attribute is also known as 3 No Single 70K No


variable, field, characteristic, 4 Yes Married 120K No
or feature. 5 No Divorced 95K Yes
 A collection of attributes Objects
6 No Married 60K No
describe an object 7 Yes Divorced 220K No
 Object is also known as 8 No Single 85K Yes
record, point, case, sample,
entity, or instance. 9 No Married 75K No
10 No Single 90K Yes
10

Data Mining - Lecture 2 16


Types of data sets
 Record
 Data Matrix
 Document Data
 Transaction Data
 Multi-Relational
 Star or snowflake schema
 Graph
 World Wide Web
 Molecular Structures

 Ordered
 Sequential Data
 Spatial Data
 Temporal Data

Data Mining - Lecture 2 17


Record Data
 Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Data Mining - Lecture 2 18


Data Matrix
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

Data Mining - Lecture 2 19


Document Data
 Each document becomes a ‘term' vector,
 each term is a component (attribute) of the vector,
 the value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

Data Mining - Lecture 2 20


Transaction Data
 A special type of record data, where
 each record (transaction) involves a set of items.

 For example, consider a grocery store. The set of


products purchased by a customer during one shopping
trip constitute a transaction, while the individual
products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Juice, Bread
3 Juice, Coke, Diaper, Milk
4 Juice, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Data Mining - Lecture 2 21


Multi-Relational Data

• Attributes are objects themselves

Data Mining - Lecture 2 22


Graph Data
 Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

Data Mining - Lecture 2 23


Chemical Data
 Benzene Molecule: C6H6

Data Mining - Lecture 2 24


Ordered Data
 Sequences of transactions
Items/Events

An element of
the sequence

Data Mining - Lecture 2 25


Ordered Data
 Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

Data Mining - Lecture 2 26


Ordered Data

 Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

Data Mining - Lecture 2 27

You might also like