0% found this document useful (0 votes)
38 views

Lecture 7 - Introduction To Data Mining

The lecture slide about introduction to data mining provides an overview of the techniques and tools used to extract meaningful insights and knowledge from large datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lecture 7 - Introduction To Data Mining

The lecture slide about introduction to data mining provides an overview of the techniques and tools used to extract meaningful insights and knowledge from large datasets.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

5/3/2021

Introduction to Data
Mining

1
5/3/2021

Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated information
• How?

UNCOVER HIDDEN INFORMATION


DATA MINING

Data Mining Definition


• Finding hidden information in a database
• Fit data to a model
• Similar terms
• Exploratory data analysis
• Data driven discovery
• Deductive learning

2
5/3/2021

Data Mining Algorithm


• Objective: Fit Data to a Model
• Descriptive
• Predictive
• Preference – Technique to choose the best model
• Search – Technique to search the data
• “Query”

Database Processing vs. Data


Mining Processing
• Query • Query
• Well defined • Poorly defined
• SQL • No precise query
language

◼ Data ◼ Data
– Operational data – Not operational data

◼ Output ◼ Output
– Precise – Fuzzy
– Subset of database – Not a subset of database

3
5/3/2021

Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)

– Find all items which are frequently purchased with milk. (association
rules)

Basic Data Mining Tasks


• Classification maps data into predefined groups or
classes
• Supervised learning
• Prediction
• Regression
• Clustering groups similar data together into
clusters.
• Unsupervised learning
• Segmentation
• Partitioning

4
5/3/2021

Basic Data Mining Tasks (cont’d)


• Link Analysis uncovers relationships among data.
• Affinity Analysis
• Association Rules
• Sequential Analysis determines sequential patterns.

CLASSIFICATION
• Assign data into predefined groups or classes.

10

10

5
5/3/2021

But it isn’t Magic


• You must know what you are looking for
• You must know how to look for you

Suppose you knew that a specific cave had gold:


What would you look for?
How would you look for it?
Might need an expert miner

11

11

“If it looks like a duck,


walks like a duck, and
quacks like a duck, then
it’s a duck.”
“If it looks like a terrorist,
walks like a terrorist, and
quacks like a terrorist, then
it’s a terrorist.”

Description Behavior Associations


Classification Clustering Link Analysis
(Profiling) (Similarity)
12

12

6
5/3/2021

Classification Ex: Grading


x

<90 >=90

x A

<80 >=80
x B

<70 >=70
x
C

<50 >=60

F D
13

13

Given a collection of annotated Katydids


data. (in this case 5 instances of
Katydids and five of Grasshoppers),
decide what type of insect the
unlabeled example is.

Grasshoppers

14

14

7
5/3/2021

Insect ID Abdomen Antennae Insect Class


Length Length

1 2.7 5.5 Grasshopper


The classification
2 8.0 9.1 Katydid
problem can now be
3 0.9 4.7 Grasshopper
expressed as:
4 1.1 3.1 Grasshopper

5 5.4 8.5 Katydid


Given a training 6 2.9 1.9 Grasshopper
database predict the 7 6.1 6.6 Katydid

class label of a 8 0.5 1.0 Grasshopper

previously unseen 9 8.3 6.6 Katydid

instance 10 8.1 4.7 Katydid

previously unseen instance = 11 5.1 7.0 ???????

15

15

10
9
8
7
Antenna Length

6
5
4
3
2
1

1 2 3 4 5 6 7 8 9 10
Abdomen Length

16
Grasshoppers Katydids
16

8
5/3/2021

Facial Recognition

17

17

Handwriting
Recognition

0.5

0
0 50 100 150 200 250 300 350 400 450

18
George Washington Manuscript
18

9
5/3/2021

Anomaly Detection

19

19

20

20

10
5/3/2021

CLUSTERING
• Partition data into previously undefined groups.

21

21

22

22

11
5/3/2021

What is Similarity?

23

23

Two Types of Clustering

Hierarchical Partitional

24

24

12
5/3/2021

Hierarchical Clustering Example


Iris Data Set

Versicolor

Sentosa Virginica

25

25

http://www.time.com/time/magazine/article/0,9171,1541283,00.html

26

26

13
5/3/2021

Microarray Data Analysis


• Each probe location associated with gene
• Color indicates degree of gene expression
• Compare different samples (normal/disease)
• Track same sample over time
• Questions
• Which genes are related to this disease?
• Which genes behave in a similar manner?
• What is the function of a gene?
• Clustering
• Hierarchical
• K-means

27

27

Microarray Data - Clustering


"Gene
expression
profiling
identifies
clinically
relevant
subtypes
of prostate
cancer"
Proc. Natl.
Acad. Sci.
USA, Vol. 101,
Issue 3, 811-
816, January
20, 2004

28

28

14
5/3/2021

ASSOCIATION RULES/
LINK ANALYSIS
• Find relationships between data

29

29

ASSOCIATION RULES
EXAMPLES
• People who buy diapers also buy beer
• If gene A is highly expressed in this disease then
gene A is also expressed
• Relationships between people
• Book Stores
• Department Stores
• Advertising
• Product Placement

30

30

15
5/3/2021

Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.

31

31

Joshua Benton and Holly


K. Hacker, “At Charters,
Cheating’s off the Charts:,
Dallas Morning News,
June 4, 2007.

32

32

16
5/3/2021

No/Little Cheating

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s


off the Charts:, Dallas Morning News, June 4, 2007.

33

33

Rampant Cheating

Joshua
Benton and
Holly K.
Hacker, “At
Charters,
Cheating’s
off the
Charts:,
Dallas
Morning
News, June
4, 2007.

34

34

17
5/3/2021

Jialun Qin, Jennifer J. Xu, Daning


Marc Sageman and Hsinchun
“Analyzing Terrorist Networks: A Case
Study of the Global Salafi Jihad
Network” 35 Lecture Notes in Computer
Science, Publisher: Springer
GmbH, Volume 3495 / 2005 , p. 287.

35

Ex: Stock Market Analysis


• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior

36

36

18
5/3/2021

Ex: Stock Market Analysis

37

37

Data Mining vs. KDD


• Knowledge Discovery in Databases (KDD): process
of finding useful information and patterns in data.
• Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD
process.

38

38

19
5/3/2021

KDD Process

Modified from [FPSS96C]


• Selection: Obtain data from various sources.
• Preprocessing: Cleanse data.
• Transformation: Convert to common format.
Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner.
39

39

KDD Process Ex: Web Log


• Selection:
• Select log data (dates and locations) to use
• Preprocessing:
• Remove identifying URLs; Remove error logs
• Transformation:
• Sessionize (sort and group)
• Data Mining:
• Identify and count patterns; Construct data structure
• Interpretation/Evaluation:
• Identify and display frequently accessed sequences.
• Potential User Applications:
• Cache prediction
• Personalization

40

40

20
5/3/2021

Related Topics
• Databases
• OLTP
• OLAP
• Information Retrieval

41

41

DB & OLTP Systems


• Schema
• (ID,Name,Address,Salary,JobNo)
• Data Model
• ER
• Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000

DM: Only imprecise queries

42

42

21
5/3/2021

Classification/Prediction is Fuzzy

Loan Reject Reject


Amnt

Accept Accept

Simple Fuzzy

43

43

Information Retrieval
• Information Retrieval (IR): retrieving desired information
from textual data.
• Library Science
• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.

DM: Similarity measures;


Mine text/Web data.
44

44

22
5/3/2021

Information Retrieval (cont’d)


• Similarity: measure of how close a query is to a
document.
• Documents which are “close enough” are retrieved.
• Metrics:
• Precision = |Relevant and Retrieved|
|Retrieved|
• Recall = |Relevant and Retrieved|
|Relevant|

45

45

IR Query Result Measures and


Classification

IR Classification

46

46

23
5/3/2021

OLAP
• Online Analytic Processing (OLAP): provides more complex
queries than OLTP.
• OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
• Dimensional data; cube view
• Visualization of operations:
• Slice: examine sub-cube.
• Dice: rotate cube to look at another dimension.
• Roll Up/Drill Down

DM: May use OLAP queries.

47

47

DM vs. Related Topics


Area Query Data Results Output
DB/OLTP Precise Database Precise DB Objects
or
Aggregation
IR Precise Documents Vague Documents
OLAP Analysis Multidimensional Precise DB Objects
or
Aggregation
DM Vague Preprocessed Vague KDD
Objects

48

48

24
5/3/2021

Data Mining Development


•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms •Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines

•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms

49

49

KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality

50

50

25
5/3/2021

Overfitting
• Suppose we want to predict whether an individual is short,
medium, or tall. What is wrong with this data?
Name Gender Height Output
Mary F 1.6 Short
Maggie F 1.9 Medium
Martha F 1.88 Medium
Stephanie F 1.7 Short
Bob M 1.85 Medium
Kathy F 1.6 Short
George M 1.7 Short
Debbie F 1.8 Medium
Todd M 1.95 Medium
Kim F 1.9 Medium
Amy F 1.8 Medium
Wynette F 1.75 Medium

51

51

KDD Issues (cont’d)


• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application

52

52

26
5/3/2021

WARNING
• With data mining you don’t always know what you
are looking for.
• There is not one right answer.
• The data you are using is noisy
• Data Mining is a very applied discipline.
• A data mining course provides you tools to use to
analyze data.
• Experience provides you knowledge of how to use
these tools.

53

53

54
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236

54

27
5/3/2021

55

55

Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
• Invalid results and claims

56

56

28
5/3/2021

Data Mining Metrics


• Usefulness
• Return on Investment (ROI)
• Accuracy
•…
• Space/Time

57

57

Visualization Techniques
• Graphical
• Geometric
• Icon-based
• Pixel-based
• Hierarchical
• Hybrid

58

58

29
5/3/2021

Models Based on Summarization


• Visualization: Frequency distribution, mean,
variance, median, mode, etc.
• Box Plot:

59

59

DM Tools
• XLMiner – Easy addin to Excel
http://www.solver.com/xlminer/index.html
• Weka – Open Source; Visualization, Functionality,
Interface
http://www.cs.waikato.ac.nz/ml/weka/
• SAS (JMP) – Commercial Product
• SPSS – Commercial Product
• MATLAB – Statistical/Math Applications
• R – Programming

61

61

30
5/3/2021

62

Thank you
for your
attentions!

63

31

You might also like