Knowledge Discovery and Data Mining (KDD)
Knowledge Discovery and Data Mining (KDD)
Affordable computing
Competitive pressure
gain an edge by providing improved, customized services
information as a product in its own right
Selection
Data mining
100
90
80
70
60
50
40
30
20
10
0
Business
Objective
Determination
Data
Preparation
Data
Mining
Analysis of
Results and
Knowledge
Assimilation
high dimensionality
Overfitting
models noise in training data, rather than just the general patterns
Understandability of patterns
Data Mining
Prediction Methods
using some variables to predict unknown or future values of
other variables
Descriptive Methods
finding human-interpretable patterns describing the data
Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Regression
Deviation Detection
Classification
Data defined in terms of attributes, one of which is the class
Classification:Example
Clustering
Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
data points in one cluster are more similar to one another
data points in separate clusters are less simislar to one
another.
Similarity measures
Euclidean distance if attributes are continuous
Problem specific measures
Association Rules:Application
Marketing and Sales Promotion:
Consider discovered rule:
{Bagels, } --> {Potato Chips}
Potato Chips as consequent: can be used to determine
what may be done to boost sales
Bagels as an antecedent: can be used to see which
products may be affected if bagels are discontinued
Can be used to see which products should be sold with
Bagels to promote sale of Potato Chips
Visualization
complement to other DM techniques like
Segmentation,etc.
20
21
22
Data Mining
Campaign Management
23
Predictive
Clustering
Classification
Association
Decision Tree
Sequential Analysis
Rule Induction
Neural Networks
Nearest Neighbor Classification
Regression
24
Honest
Tridas
Vickie
Mike
Wally
Waldo
Barney
Crooked
25
Prediction
Tridas
Vickie
Mike
26
Decision Trees
Data
height
short
tall
tall
short
tall
tall
tall
short
hair
blond
blond
red
dark
dark
blond
dark
blond
eyes
blue
brown
blue
blue
blue
blue
brown
brown
class
A
B
A
B
B
A
B
B
27
blond
red
short, blue = B
tall, blue = B
tall, brown= B
{tall, blue = A}
short, blue = A
tall, brown = B
tall, blue = A
short, brown = B
blond
red
short, blue = B
tall, blue = B
tall, brown= B
{tall, blue = A}
short, blue = A
tall, brown = B
tall, blue = A
short, brown = B
eye
blue
short = A
tall = A
brown
tall = B
short = B
29
Decision Trees:
Learned Predictive Rules
hair
dark
blond
red
eyes
blue
brown
30
Decision Trees:
Another Example
Total list
50% member
0-1 child
$50-75k income
15% member
2-3 child
20% member
$75k+ income
70% member
4+ children
$50-75k income
Age: 20-40
45% member
$20-50k income
85% member
Age: 40-60
80% member
31
Rule Induction
Try to find rules of the form
IF <left-hand-side> THEN <right-hand-side>
This is the reverse of a rule-based agent, where the rules
are given and the agent must act. Here the actions are
given and we have to discover the rules!
33
Coupons, discounts
Product placement
Timing of cross-marketing
Discovery of patterns
34
1891
685
24
ice cream
-----
1088
322
pet food
-----
-----
2451
Clustering
The art of finding groups in data
Objective: gather items from a database into sets
according to (unknown) common characteristics
Much more difficult than classification since the
classes are not known in advance (no training)
Technique: unsupervised learning
36
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K objects as initial
cluster center
10
Assign
each
of the
objects
to
most
similar
center
3
2
1
0
0
10
Update
the
cluster
means
4
3
2
1
0
0
reassign
10
4
3
2
1
0
1
10
reassign
10
10
Update
the
cluster
means
4
3
2
1
0
0
37
10
Opinion Analysis
Word-of-mouth on the Web
The Web has dramatically changed the way that
consumers express their opinions.
One can post reviews of products at merchant
sites, Web forums, discussion groups, blogs
Techniques are being developed to exploit these
sources.
Benefits of Review Analysis
Potential Customer: No need to read many reviews
Product manufacturer: market intelligence, product
benchmarking
38
39
40
Introduction
Two main types of textual information.
Facts and Opinions
Note: factual statements can imply opinions too.
Introduction user-generated
media
Importance of opinions:
A Fascinating Problem!
43
An Example Review
I bought an iPhone a few days ago. It was such a
nice phone. The touch screen was really cool. The
voice quality was clear too. Although the battery life
was not long, that is ok for me. However, my mother
was mad with me as I did not tell her before I bought
the phone. She also thought the phone was too
expensive, and wanted me to return it to the shop.
What do we see?
Opinions, targets of opinions, and opinion holders
44
An opinion is a quintuple
(oj, fjk, soijkl, hi, tl),
where
oj is a target object.
fjk is a feature of the object oj.
soijkl is the sentiment value of the opinion of the opinion holder hi
on feature fjk of object oj at time tl. soijkl is +ve, -ve, or neu, or a
more granular rating.
hi is an opinion holder.
tl is the time when the opinion is expressed.
46
Structured Data
47
48
Subjectivity Analysis
(Wiebe et al 2004)
Negative: 6
The screen is easily scratched.
I have a lot of difficulty in removing
finger marks from the touch screen.
Summary of
reviews of
Cell Phone 1
_
Voice
Comparison of
reviews of
Screen
Cell Phone 1
Cell Phone 2
_
52
Battery
Size
Weight