Module 1
Module 1
By,
Dr. B.Prabadevi
Associate Professor, SITE
Vellore Institute of Technology
SJT-111-A15
Contact: [email protected]
+91-9442690043
MODULE 1: DATA MINING CONCEPT
• Introduction to Data Mining – Data Mining Functionalities –
Classification of Data Mining Systems, Data Mining Task Primitives-
Integration of Data Mining With Database- Major Issues in Data
Mining.
Data Mining
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
Improving health care and reducing costs Predicting the impact of climate change
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
• Sequential Data: transaction sequences
• Genetic sequence data
• Spatial, image and multimedia:
Document 1 3 0 5 0 2 6 0 2 0 2
• Spatial data: maps
• Image data: Document 2 0 7 0 2 1 0 0 3 0 0
• Video data:
Document 3 0 1 0 0 1 2 2 0 3 0
12/12/2023 SWE2009-Data Mining Techniques 18
Data collection sources
• Repositories:
• https://www.kaggle.com/discussions/general/210203
• Google Dataset Search
• Datasetlist
• UCI
• ImageNet
• CT Medical Images
• https://www.kdnuggets.com/datasets/index.html
Objects
an object 4 Yes Married 120K No
• Examples: eye color of a person, temperature, customer 5 No Divorced 95K Yes
_ID, name, address
6 No Married 60K No
• Attribute is also known as variable, field, characteristic,
dimension, or feature 7 Yes Divorced 220K No
• A collection of attributes describe an object 8 No Single 85K Yes
• Object is also known as record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
• Data preparation, cleaning, and transformation comprises the majority of the work in a data
mining application (90%).
12/12/2023 SWE2009-Data Mining Techniques 36
Sample dataset CustomerID Genre
1Male
2Male
Age
19
21
Annual
15
15
Spending Score (1-
Income (k$) 100)
39
81
3Female 20 16 6
4Female 23 16 77
5Female 31 17 40
6Female 22 17 76
7Female 35 18 6
Independent 8Female 23 18 94
variable 9Male 64 19 3
10Female 30 19 72
11Male 67 19 14
12Female 35 19 99 Dependent variable
13Female 58 20 15
14Female 24 20 77
15Male 37 20 13
16Male 22 20 79
17Female 35 21 35
18Male 20 21 66
19Male 52 23 29
20Female 35 23 98
21Male 35 24 35
22Male 25 24 73
23Female 46 25 5
24Male 31 25 73
25Female 54 28 14
26Male 29 28 82
27Female 45 28 32
28Male 35 28 61
29Female 40 29 31
30Female 23 29 87
12/12/2023 SWE2009-Data Mining Techniques 37
Independent variable vs Dependent
variable
• The independent variable is the
cause.
• Its value is independent of other
variables.
• The dependent variable is the effect.
• Its value depends on changes in the
independent variable.
Data:
11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,40,45,45,45,71,7
2,73,75
Do the Following based on Equi-Depth, and Equi-Width Binning
methods by choosing 4 Bins
a) Smoothing by bin mean
b) Smoothing by bin median
c) Smoothing by bin boundaries
Clu
s teri
Data
ng ng
Tid Refund Marital Taxable
Status Income Cheat
el i
od
1 Yes Single 125K No
2 No Married 100K No
M
ve
3 No Single 70K No
c ti
4 Yes Married 120K No
i
ed
5 No Divorced 95K Yes
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No
An
10 No Single 90K Yes
De oma
11 No Married 60K No
tec ly
oc i
13 No Single 85K Yes
s 14 No Married 75K No
ti o
As s 15 No Single 90K Yes n
le
10
Ru
Milk
• “What makes a pattern interesting? Can a data mining system generate all of
the interesting patterns? Can a data mining system generate only interesting
patterns?” • Measures:
• support(X Y) = P(XUY)
• a pattern is interesting if it is • confidence(X Y) = P(Y | X)
• easily understood by humans,
• valid on new or test data with some degree of certainty, potentially useful, and novel.
• Can a data mining system generate all of the interesting patterns?
• Completeness of the data mining algorithms
• It is often unrealistic and inefficient for data mining systems to generate all of the possible
patterns
• Can a data mining system generate only interesting patterns?”
• an optimization problem in data mining.
• It is highly desirable for data mining systems to generate only interesting patterns.
• It would be more efficient for users and DM systems as it may reduce the search for
interesting patterns
12/12/2023 SWE2009-Data Mining Techniques 56
Interestingness measures • Nine specific criteria are used to determine
whether or not a pattern is interesting:
1. conciseness,
• Measures are intended for 2. coverage,
selecting and ranking patterns 3. reliability,
4. peculiarity,
according to their potential 5. diversity,
interest to the user 6. novelty,
• Good measures also allow the 7. surprisingness,
8. utility, and
time and space costs of the 9. actionability
mining process to be reduced
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
3. Reliability: A pattern is reliable if the relationship described by the pattern occurs in a high percentage
of applicable cases highly accurate classification or higher confidence associations
• Bread Milk occurring in all feasible transactions in all applications
4. Peculiarity: A pattern is peculiar if it is far away from other discovered patterns according to some
distance measure.
• Peculiar patterns are generated from peculiar data (or outliers), which are relatively few in number and
significantly different from the rest of the data
• Peculiar patterns may be unknown to the user
6. Novelty: A pattern is novel to a person if he or she did not know it before and is notable to infer it
from other known patterns.
• a novel pattern is new and not contradicted by any pattern already known to the use
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering,
etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and
Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global
information systems
66
Data Mining: Classification Schemes
• General functionality
• Descriptive data mining
• Predictive data mining
• what happened?
• where exactly is the problem? Require Data aggregation and data mining Statistics and forecasting methods
• Analysis include
• Time-series data analysis
• Sequence or periodicity pattern matching
• Similarity-based data analysis.
97
Data Mining Function: (6) Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
• Sequence, trend and evolution analysis
• Trend, time-series, and deviation analysis: e.g., regression and value
prediction
• Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory cards
• Periodicity analysis
• Motifs and biological sequence analysis
• Approximate and consecutive motifs (a decorative image or design, especially
a repeated one forming a pattern.)
• Similarity-based analysis
• Mining data streams
• Ordered, time-varying, potentially infinite, data streams
98
Pop-Quiz
• OLAP is_____
• OLTP is_____
• OLAP is used for _____
• Prediction is done using____
• Clustering using ____ data
• Classification and Regression uses _____data
• Which is supervised learning?
• Supervised learning applications are:
• Unsupervised learning applications:
• Identify whether the product A($2, “SUPERSONIC”, Disney, Toy, 5yrs+) will attract more
customers in the upcoming sale?
• Determine the revenue that will be developed in an upcoming sale by each product in Amazon?
• Mining Methodology
• Mining various and new kinds of knowledge Mining different kinds of
• Mining knowledge in multi-dimensional space knowledge in databases:
105
Steps in Data Mining-Project Perspective
1. Develop an understanding of the purpose of the data mining project.
How will the stakeholder use the results? Who will be affected by the
results? Will the analysis be a one-shot effort or an ongoing procedure?
2. Obtain the dataset to be used in the analysis.
• This often involves sampling from a large database to capture records to be used
in an analysis. It may also involve pulling together data from different databases
or sources.
• The databases could be internal (e.g., past purchases made by customers) or
external (credit ratings). While data mining deals with very large databases,
usually the analysis to be done requires only thousands or tens of thousands of
records.