0% found this document useful (0 votes)

14 views

02 - Data Mining

The document discusses several common data mining tasks including classification, clustering, association rule discovery, sequential pattern discovery, regression, and deviation detection. It provides definitions and examples of how each task can be applied to real-world problems.

Uploaded by

Dd d

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

02 - Data Mining

Uploaded by

Dd d

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining:

Data Mining Tasks

& Data

Data Mining - Lecture 2

Data Mining Tasks

 Prediction Tasks
 Use some variables to predict unknown or future values of other
variables
 Description Tasks
 Find human-interpretable patterns that describe the data.

Common data mining tasks

 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

Data Mining - Lecture 2 2

Classification: Definition

 Given a collection of records (training set )

 Each record contains a set of attributes, one of the attributes is the
class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with training
set used to build the model and test set used to validate it.

Data Mining - Lecture 2 3

Classification Example
Refund Marital Taxable
Status Income Cheat Cheat

No Single 75K ? No
Yes Married 50K ? No
Tid Refund Marital Taxable
Status Income Cheat No Married 150K ? No

Yes Divorced 90K ? Yes

1 Yes Single 125K No
No Single 40K ? No
2 No Married 100K No
No Married 80K ? No
3 No Single 70K No 10
10

4 Yes Married 120K No

5 No Divorced 95K Yes
6 No Married 60K No Test
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier

Data Mining - Lecture 2 4

Classification: Application 1

 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
 Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
 Type of business, where they stay, how much they earn, etc.
 Use this information as input attributes to learn a classifier
model.

Data Mining - Lecture 2 5

Classification: Application 2

 Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions.
 Approach:
 Use credit card transactions and the information on its
account-holder as attributes.
 When does a customer buy, what does he buy, how often he
pays on time, etc
 Label past transactions as fraud or fair transactions. This forms
the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card
transactions on an account.

Data Mining - Lecture 2 6

Clustering Definition

 Given a set of data points, each having a set of

attributes, and a similarity measure among
them, find clusters such that
 Data points in one cluster are more similar to one
another.
 Data points in separate clusters are less similar to one
another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

Data Mining - Lecture 2 7

Clustering: Application 1

 Market Segmentation:
 Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
 Approach:
 Collect different attributes of customers based on
their geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.

Data Mining - Lecture 2 8

Clustering: Application 2

 Document Clustering:
 Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
 Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of different
terms. Use it to cluster.
 Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.

Data Mining - Lecture 2 9

Association Rule Discovery: Definition
 Given a set of records each of which contain some
number of items from a given collection;
 Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Juice, Bread {Milk} --> {Coke}
3 Juice, Coke, Diaper, Milk {Diaper, Milk} --> {Juice}
4 Juice, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Data Mining - Lecture 2 10

Association Rule Discovery: Application 1

 Marketing and Sales Promotion:

 Let the rule discovered be
{Cookies, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
 Cookies in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling Cookies.
 Cookies in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Cookies to promote sale of Potato chips!

Data Mining - Lecture 2 11

Association Rule Discovery: Application 2

 Supermarket shelf management.

 Goal: To identify items that are bought
together by sufficiently many customers.
 Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he is very
likely to buy juice.
 So, don’t be surprised if you find six-packs stacked
next to diapers!

Data Mining - Lecture 2 12

Sequential Pattern Discovery: Definition

Given is a set of objects, with each object associated with

its own timeline of events, find rules that predict strong
sequential dependencies among different events:

 In point-of-sale transaction sequences,

 Computer Bookstore:

(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies)

 Athletic Apparel Store:

(Shoes) (Racket, Racketball) --> (Sports_Jacket)

Data Mining - Lecture 2 13

Regression
 Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
 Greatly studied in statistics, neural network
fields.
 Examples: age of a person?
 Predicting the age of a person based on
MaritalStatus, NumberOfChildren, Income,…
 E.g., If MaritalStatus=Yes, Age = 20

+4*NumberOfChildren+0.0001*Income+…
 Predicting wind velocities as a function of
temperature, humidity, and pressure.
Data Mining - Lecture 2 14
Deviation/Anomaly Detection

 Detect significant deviations

from normal behavior
 Applications:
 Credit Card Fraud Detection

Data Mining - Lecture 2 15

What is Data?
Attributes
 Collection of data objects and
their attributes.
Tid Refund Marital Taxable
 An attribute is a property or Status Income Cheat
characteristic of an object
1 Yes Single 125K No
 Examples: eye color of a
person, temperature, etc. 2 No Married 100K No

 Attribute is also known as 3 No Single 70K No

variable, field, characteristic, 4 Yes Married 120K No
or feature. 5 No Divorced 95K Yes
 A collection of attributes Objects
6 No Married 60K No
describe an object 7 Yes Divorced 220K No
 Object is also known as 8 No Single 85K Yes
record, point, case, sample,
entity, or instance. 9 No Married 75K No
10 No Single 90K Yes
10

Data Mining - Lecture 2 16

Types of data sets
 Record
 Data Matrix
 Document Data
 Transaction Data
 Multi-Relational
 Star or snowflake schema
 Graph
 World Wide Web
 Molecular Structures

 Ordered
 Sequential Data
 Spatial Data
 Temporal Data

Data Mining - Lecture 2 17

Record Data
 Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Data Mining - Lecture 2 18

Data Matrix
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

Data Mining - Lecture 2 19

Document Data
 Each document becomes a ‘term' vector,
 each term is a component (attribute) of the vector,
 the value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

Data Mining - Lecture 2 20

Transaction Data
 A special type of record data, where
 each record (transaction) involves a set of items.

 For example, consider a grocery store. The set of

products purchased by a customer during one shopping
trip constitute a transaction, while the individual
products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Juice, Bread
3 Juice, Coke, Diaper, Milk
4 Juice, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Data Mining - Lecture 2 21

Multi-Relational Data

• Attributes are objects themselves

Data Mining - Lecture 2 22

Graph Data
 Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

Data Mining - Lecture 2 23

Chemical Data
 Benzene Molecule: C6H6

Data Mining - Lecture 2 24

Ordered Data
 Sequences of transactions
Items/Events

An element of
the sequence

Data Mining - Lecture 2 25

Ordered Data
 Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

Data Mining - Lecture 2 26

Ordered Data

 Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

Data Mining - Lecture 2 27

Dividends Still Don't Lie: The Truth About Investing in Blue Chip Stocks and Winning in the Stock Market
From Everand
Dividends Still Don't Lie: The Truth About Investing in Blue Chip Stocks and Winning in the Stock Market
Kelley Wright
No ratings yet
Calculation of Heat and Mass Balance
75% (4)
Calculation of Heat and Mass Balance
18 pages
Unit 16 Plastic Analysis
No ratings yet
Unit 16 Plastic Analysis
34 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
datamining ch1
No ratings yet
datamining ch1
24 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Mining
No ratings yet
Data Mining
23 pages
Data Mining: Introduction: Lecture Notes For Chapter 1
No ratings yet
Data Mining: Introduction: Lecture Notes For Chapter 1
32 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
DATA MINING
No ratings yet
DATA MINING
7 pages
Ch2 DTasks
No ratings yet
Ch2 DTasks
44 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Lect 1
No ratings yet
Lect 1
38 pages
DMlecture1
No ratings yet
DMlecture1
39 pages
Data Mining and Warehousing: - Module 1 - Introduction
No ratings yet
Data Mining and Warehousing: - Module 1 - Introduction
29 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Lec 1
No ratings yet
Lec 1
33 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
2a. Basic Data Mining Techniques
No ratings yet
2a. Basic Data Mining Techniques
39 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining
No ratings yet
Data Mining
37 pages
7e4aa890-c48b-42f1-a1ac-77279cc316e8 (1)
No ratings yet
7e4aa890-c48b-42f1-a1ac-77279cc316e8 (1)
58 pages
Data Mining
No ratings yet
Data Mining
26 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Instructor:: Doaa Adil Mohamed Altayeb
No ratings yet
Instructor:: Doaa Adil Mohamed Altayeb
34 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
36 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
20 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
Lecture - 2 - Data Mining Concepts
No ratings yet
Lecture - 2 - Data Mining Concepts
30 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Knowledge Discovery & Data Mining
No ratings yet
Knowledge Discovery & Data Mining
30 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
001Lecture_1 Introduction-1
No ratings yet
001Lecture_1 Introduction-1
40 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
CS822-DataMining-Week1 (1)
No ratings yet
CS822-DataMining-Week1 (1)
97 pages
Datamining Presentation
No ratings yet
Datamining Presentation
20 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
31 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
Lecture Notes 1.1 & 1.2
No ratings yet
Lecture Notes 1.1 & 1.2
8 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
34 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
INS2061 Introductions
No ratings yet
INS2061 Introductions
75 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
From Everand
Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
Gordon S. Linoff
4/5 (8)
Newland 2084
From Everand
Newland 2084
Lyno
No ratings yet
Final Research
No ratings yet
Final Research
65 pages
Meetings Compliance and Administration - Notes.
No ratings yet
Meetings Compliance and Administration - Notes.
174 pages
The Power of Transformation PDF
No ratings yet
The Power of Transformation PDF
238 pages
Math 212-1 - Exam 2
No ratings yet
Math 212-1 - Exam 2
2 pages
P-Block Class Notes
No ratings yet
P-Block Class Notes
17 pages
Four Brothers Sheet Music PDF
No ratings yet
Four Brothers Sheet Music PDF
2 pages
Answers Series 3 2013
100% (1)
Answers Series 3 2013
8 pages
Competitive Rivalry and Competitive Dynamics
100% (4)
Competitive Rivalry and Competitive Dynamics
41 pages
Mitsubishi Servo MR-C Instruction Manual
No ratings yet
Mitsubishi Servo MR-C Instruction Manual
182 pages
Eid Snew Method Statement of Practical Implementation and Project Execution Plan and Value Adding Services
No ratings yet
Eid Snew Method Statement of Practical Implementation and Project Execution Plan and Value Adding Services
6 pages
Dec. 29, 1970 Jean-Claude Asscher 3,550,228: Filed Aug. 6, 1968 4 Sheets-Sheet L
No ratings yet
Dec. 29, 1970 Jean-Claude Asscher 3,550,228: Filed Aug. 6, 1968 4 Sheets-Sheet L
6 pages
Molecular Dynamics
No ratings yet
Molecular Dynamics
54 pages
2 Stroke Top End and Performance
0% (1)
2 Stroke Top End and Performance
41 pages
CADD Rubrics PDF
No ratings yet
CADD Rubrics PDF
1 page
On The Method of Ship's Transoceanic Route Planning
No ratings yet
On The Method of Ship's Transoceanic Route Planning
8 pages
Chap 4 Contemporary World
No ratings yet
Chap 4 Contemporary World
8 pages
IT Management Week 3
No ratings yet
IT Management Week 3
32 pages
Continental Steel Vs Montaño
No ratings yet
Continental Steel Vs Montaño
2 pages
BW 1979 Outrage V
No ratings yet
BW 1979 Outrage V
9 pages
Light Alloy Drill Pipe of Improved Dependability: Aquatic - Dril Pipe Company "Adp" LLC
No ratings yet
Light Alloy Drill Pipe of Improved Dependability: Aquatic - Dril Pipe Company "Adp" LLC
9 pages
Binary Arithmatic and Codes
No ratings yet
Binary Arithmatic and Codes
20 pages
IT Security Specialist Home Credit: About Position
No ratings yet
IT Security Specialist Home Credit: About Position
2 pages
Valuation Report: Wilanow One Project
No ratings yet
Valuation Report: Wilanow One Project
94 pages
Question Bank For Machine Design
No ratings yet
Question Bank For Machine Design
3 pages
Jardine Davies Inc. v. Court of Appeals
No ratings yet
Jardine Davies Inc. v. Court of Appeals
8 pages
Risk Assessment: Child Labour-Tackling The Root Cause
No ratings yet
Risk Assessment: Child Labour-Tackling The Root Cause
3 pages
7th Tamil Guide Term 1 219077
No ratings yet
7th Tamil Guide Term 1 219077
87 pages
ET 2008 Steam Circulation System
100% (2)
ET 2008 Steam Circulation System
49 pages

02 - Data Mining

Uploaded by

02 - Data Mining

Uploaded by

Data Mining:

Data Mining Tasks

Data Mining - Lecture 2

Common data mining tasks

Data Mining - Lecture 2 2

 Given a collection of records (training set )

Data Mining - Lecture 2 3

Yes Divorced 90K ? Yes

4 Yes Married 120K No

Data Mining - Lecture 2 4

Data Mining - Lecture 2 5

Data Mining - Lecture 2 6

 Given a set of data points, each having a set of

Data Mining - Lecture 2 7

Data Mining - Lecture 2 8

Data Mining - Lecture 2 9

Data Mining - Lecture 2 10

 Marketing and Sales Promotion:

Data Mining - Lecture 2 11

 Supermarket shelf management.

Data Mining - Lecture 2 12

Given is a set of objects, with each object associated with

 In point-of-sale transaction sequences,

(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies)

 Athletic Apparel Store:

Data Mining - Lecture 2 13

 Detect significant deviations

Data Mining - Lecture 2 15

 Attribute is also known as 3 No Single 70K No

Data Mining - Lecture 2 16

Data Mining - Lecture 2 17

1 Yes Single 125K No

Data Mining - Lecture 2 18

 Such data set can be represented by an m by n matrix,

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

Data Mining - Lecture 2 19

Data Mining - Lecture 2 20

 For example, consider a grocery store. The set of

Data Mining - Lecture 2 21

• Attributes are objects themselves

Data Mining - Lecture 2 22

Data Mining - Lecture 2 23

Data Mining - Lecture 2 24

Data Mining - Lecture 2 25

Data Mining - Lecture 2 26

Data Mining - Lecture 2 27

You might also like