Lecture 7 - Introduction To Data Mining
Lecture 7 - Introduction To Data Mining
Introduction to Data
Mining
1
5/3/2021
Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated information
• How?
2
5/3/2021
◼ Data ◼ Data
– Operational data – Not operational data
◼ Output ◼ Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
3
5/3/2021
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk. (association
rules)
4
5/3/2021
CLASSIFICATION
• Assign data into predefined groups or classes.
10
10
5
5/3/2021
11
11
12
6
5/3/2021
<90 >=90
x A
<80 >=80
x B
<70 >=70
x
C
<50 >=60
F D
13
13
Grasshoppers
14
14
7
5/3/2021
15
15
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
16
Grasshoppers Katydids
16
8
5/3/2021
Facial Recognition
17
17
Handwriting
Recognition
0.5
0
0 50 100 150 200 250 300 350 400 450
18
George Washington Manuscript
18
9
5/3/2021
Anomaly Detection
19
19
20
20
10
5/3/2021
CLUSTERING
• Partition data into previously undefined groups.
21
21
22
22
11
5/3/2021
What is Similarity?
23
23
Hierarchical Partitional
24
24
12
5/3/2021
Versicolor
Sentosa Virginica
25
25
http://www.time.com/time/magazine/article/0,9171,1541283,00.html
26
26
13
5/3/2021
27
27
28
28
14
5/3/2021
ASSOCIATION RULES/
LINK ANALYSIS
• Find relationships between data
29
29
ASSOCIATION RULES
EXAMPLES
• People who buy diapers also buy beer
• If gene A is highly expressed in this disease then
gene A is also expressed
• Relationships between people
• Book Stores
• Department Stores
• Advertising
• Product Placement
30
30
15
5/3/2021
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
31
31
32
32
16
5/3/2021
No/Little Cheating
33
33
Rampant Cheating
Joshua
Benton and
Holly K.
Hacker, “At
Charters,
Cheating’s
off the
Charts:,
Dallas
Morning
News, June
4, 2007.
34
34
17
5/3/2021
35
36
36
18
5/3/2021
37
37
38
38
19
5/3/2021
KDD Process
39
40
40
20
5/3/2021
Related Topics
• Databases
• OLTP
• OLAP
• Information Retrieval
41
41
42
42
21
5/3/2021
Classification/Prediction is Fuzzy
Accept Accept
Simple Fuzzy
43
43
Information Retrieval
• Information Retrieval (IR): retrieving desired information
from textual data.
• Library Science
• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.
44
22
5/3/2021
45
45
IR Classification
46
46
23
5/3/2021
OLAP
• Online Analytic Processing (OLAP): provides more complex
queries than OLTP.
• OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
• Dimensional data; cube view
• Visualization of operations:
• Slice: examine sub-cube.
• Dice: rotate cube to look at another dimension.
• Roll Up/Drill Down
47
47
48
48
24
5/3/2021
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms
49
49
KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality
50
50
25
5/3/2021
Overfitting
• Suppose we want to predict whether an individual is short,
medium, or tall. What is wrong with this data?
Name Gender Height Output
Mary F 1.6 Short
Maggie F 1.9 Medium
Martha F 1.88 Medium
Stephanie F 1.7 Short
Bob M 1.85 Medium
Kathy F 1.6 Short
George M 1.7 Short
Debbie F 1.8 Medium
Todd M 1.95 Medium
Kim F 1.9 Medium
Amy F 1.8 Medium
Wynette F 1.75 Medium
51
51
52
52
26
5/3/2021
WARNING
• With data mining you don’t always know what you
are looking for.
• There is not one right answer.
• The data you are using is noisy
• Data Mining is a very applied discipline.
• A data mining course provides you tools to use to
analyze data.
• Experience provides you knowledge of how to use
these tools.
53
53
54
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
54
27
5/3/2021
55
55
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
• Invalid results and claims
56
56
28
5/3/2021
57
57
Visualization Techniques
• Graphical
• Geometric
• Icon-based
• Pixel-based
• Hierarchical
• Hybrid
58
58
29
5/3/2021
59
59
DM Tools
• XLMiner – Easy addin to Excel
http://www.solver.com/xlminer/index.html
• Weka – Open Source; Visualization, Functionality,
Interface
http://www.cs.waikato.ac.nz/ml/weka/
• SAS (JMP) – Commercial Product
• SPSS – Commercial Product
• MATLAB – Statistical/Math Applications
• R – Programming
61
61
30
5/3/2021
62
Thank you
for your
attentions!
63
31