0% found this document useful (0 votes)
12 views

Module 1

This document provides an overview of the SWE2009 Data Mining Techniques course taught by Dr. B. Prabadevi at Vellore Institute of Technology. The course covers data mining concepts including an introduction to data mining, the data mining process, and applications of data mining techniques. It is a 3 credit course offered in the fourth semester with 4 hours of lectures per week. Contact information is provided for the course instructor.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Module 1

This document provides an overview of the SWE2009 Data Mining Techniques course taught by Dr. B. Prabadevi at Vellore Institute of Technology. The course covers data mining concepts including an introduction to data mining, the data mining process, and applications of data mining techniques. It is a 3 credit course offered in the fourth semester with 4 hours of lectures per week. Contact information is provided for the course instructor.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 107

SWE2009-

Data Mining Techniques


L T P J C
3 0 0 4 4

By,
Dr. B.Prabadevi
Associate Professor, SITE
Vellore Institute of Technology
SJT-111-A15
Contact: [email protected]
+91-9442690043
MODULE 1: DATA MINING CONCEPT
• Introduction to Data Mining – Data Mining Functionalities –
Classification of Data Mining Systems, Data Mining Task Primitives-
Integration of Data Mining With Database- Major Issues in Data
Mining.

12/12/2023 SWE2009-Data Mining Techniques 2


What is Data Mining?
• Data information that has been translated into a form that is
efficient for processing
• Mining Extraction of valuable information
• Data +Mining  Extracting valuable data from large datasets

Data Mining

12/12/2023 SWE2009-Data Mining Techniques 3


What is Data Mining?
• Knowledge discovery from data (KDD)
• Extract usable data from larger set of any raw data
• DM is the process of sorting through large data sets to identify patterns and
relationships that can help solve real world business problems through data analysis.
• to find patterns, discover trends, and gain insight into how that data can be used.
• Examples:
• Customer purchasing pattern analysis
• Student performance analysis
• Patient disease prediction, disease classification
• Misuse detection
• Customer churn prediction
• Credit card fraud detection

12/12/2023 SWE2009-Data Mining Techniques 4


• Data mining is the process of discovering
interesting patterns and knowledge from
large amounts of data.
• The data sources can include databases,
data warehouses, the Web, other
information repositories, or data that are
Formal streamed into the system dynamically.
definition

12/12/2023 SWE2009-Data Mining Techniques 5


KDD: A Definition
KDD is the automatic or semi-automatic
extraction of non-obvious, hidden knowledge from large
volumes of data.

1018-1024 bytes: What is the knowledge?


we never see the whole Then run Data
How to represent
data set, so will put it in Mining algorithms
and use it?
the memory of computers

12/12/2023 SWE2009-Data Mining Techniques


6
Data, Information, Knowledge
We often see data as a string of bits, or numbers
and symbols, or “objects” which we collect daily.

Information is data stripped of redundancy, and


reduced to the minimum necessary to characterize
the data.

Knowledge is integrated information, including facts


and their relations, which have been perceived,
discovered, or learned as our “mental pictures”.
Knowledge can be considered data at
a high level of abstraction and generalization.

12/12/2023 SWE2009-Data Mining Techniques


7
Why Data Mining?

 There has been enormous data E-Commerce

growth in both commercial and Cyber Security

scientific databases due to


advances in data generation and
collection technologies
Traffic Patterns Social Networking:
Twitter

Sensor Networks Computational Simulations

12/12/2023 SWE2009-Data Mining Techniques 8


The world is data rich but

Why Data Mining? information poor.

• We Live in Information Age!


• Yottabyte(YB=1024 ZB) of data accumulates in our
computer networks through WWW, storage pools,
businesses, smart gadgets and so..
• Telecommunications, Web searches carry thousands of Zettabytes
of data traffic every day each --> So called Netizens through OSN
• Medical industry generates tremendous amounts of data
• Businesses like Dmart, Amazon generate gigantic sales data

• Data thrives us, to live lively in this futurist meta


world, analysing data makes a big sense]
• Here's
12/12/2023DATA MINING SWE2009-Data Mining Techniques 9
Why Data Mining? Commercial Viewpoint

• Lots of data is being collected


and warehoused
• Web data
• Google has Peta to yotta Bytes of web data
• Facebook has trillions of active users
• purchases at department/grocery stores, e-commerce
• Amazon handles millions of visits/day
• Bank/Credit Card transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong
• Provide better, customized services for an edge (e.g. in Customer Relationship Management)

12/12/2023 SWE2009-Data Mining Techniques 10


Why Data Mining? Scientific
Viewpoint
• Data collected and stored at
enormous speeds
• remote sensors on a satellite
• NASA EOSDIS archives over
fMRI Data from Brain Sky Survey Data
petabytes of earth science data / year
• telescopes scanning the skies
• Sky survey data
• High-throughput biological data
• scientific simulations
• terabytes of data generated in a few hours Gene Expression Data

• Data mining helps scientists


• in automated analysis of massive datasets
• In hypothesis formation
Surface Temperature of Earth
12/12/2023 SWE2009-Data Mining Techniques 11
Data mining turns a large collection of
data into knowledge DM searching for
knowledge
• Example: Google Search engine (interesting
patterns) in data.
• Consider each user's search a transaction
• Example of users searching for Apple :
• It may list MAC or the fruit Apple
• Search Queries disclose invaluable information
such as:
• User search pattern
• Location based searches
• Age based trend details
• Category based information
• Benefit:
• This provides insight for decision makers
content creators and other businesses
12/12/2023 SWE2009-Data Mining Techniques 12
Data Mining: Confluence of Multiple
Disciplines
Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines

12/12/2023 SWE2009-Data Mining Techniques 13


Great Opportunities to Solve Society’s
Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production
12/12/2023 SWE2009-Data Mining Techniques 14
Data Mining Process
• The main aim of the data mining process is to extract information
from a data set and translate it into an understandable structure to
be used in the future
• The data mining process is split into two parts: Data Preprocessing
and Mining.

Data mining process


12/12/2023 SWE2009-Data Mining Techniques 15
Steps in
Datamining Process

12/12/2023 SWE2009-Data Mining Techniques 16


Steps in Data mining Process
• Iterative steps in Data mining
• Data cleaning (to remove noise and inconsistent data)
• Data integration (where multiple data sources may be combined)
• Data selection (where data relevant to the analysis task are retrieved from the database)
• Data transformation (where data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations)
• Data mining (an essential process where intelligent methods are applied to extract data
patterns)
• Pattern evaluation (to identify the truly interesting patterns representing knowledge based
on interestingness measures)
• Knowledge representation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)

12/12/2023 SWE2009-Data Mining Techniques 17


• Record Types of Data Sets
• Relational records
• Data matrix, e.g., numerical matrix, crosstabs
• Document data: text documents:
TID Items
• Transaction data
1 Bread, Coke, Milk
• Graph and network
2 Beer, Bread
• World Wide Web
3 Beer, Coke, Diaper, Milk
• Social or information networks
4 Beer, Bread, Diaper, Milk
• Molecular Structures
5 Coke, Diaper, Milk
• Ordered
• Video data: sequence of images
• Temporal data: time-series

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
• Sequential Data: transaction sequences
• Genetic sequence data
• Spatial, image and multimedia:
Document 1 3 0 5 0 2 6 0 2 0 2
• Spatial data: maps
• Image data: Document 2 0 7 0 2 1 0 0 3 0 0
• Video data:
Document 3 0 1 0 0 1 2 2 0 3 0
12/12/2023 SWE2009-Data Mining Techniques 18
Data collection sources
• Repositories:
• https://www.kaggle.com/discussions/general/210203
• Google Dataset Search
• Datasetlist
• UCI
• ImageNet
• CT Medical Images
• https://www.kdnuggets.com/datasets/index.html

12/12/2023 SWE2009-Data Mining Techniques 19


Attributes
Types of data
Tid Refund Marital Taxable
Status Income Cheat
• What is Data? 1 Yes Single 125K No
• Collection of data objects and their attributes 2 No Married 100K No

• An attribute is a property or characteristic of 3 No Single 70K No

Objects
an object 4 Yes Married 120K No
• Examples: eye color of a person, temperature, customer 5 No Divorced 95K Yes
_ID, name, address
6 No Married 60K No
• Attribute is also known as variable, field, characteristic,
dimension, or feature 7 Yes Divorced 220K No
• A collection of attributes describe an object 8 No Single 85K Yes
• Object is also known as record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10

12/12/2023 SWE2009-Data Mining Techniques 20


Data: Attribute value
• Attribute Values: are numbers or symbols assigned to an attribute for
a particular object
• Distinction between attributes and attribute values
• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
• Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute can be different than the properties of the
values used to represent the attribute

12/12/2023 SWE2009-Data Mining Techniques 21


Data: Types of Attributes
• Qualitative or Categorical data describes categories or
groups.
• Ex: car brands like Mercedes, BMW and Audi
• Answers to yes and no questions.
• Quantitative data
• represents numbers.
• It is further divided into subsets: Numeric, discrete and continuous.

• Numeric: Numbers, measurable quantity, represented in integer or real


values
• Numerical attributes are of 2 types, interval, and ratio.
• Discrete data can usually be counted in a finite matter.
• Ex: No of chocolates you want to have, Grades at university are discrete
• Continuous data is infinite, impossible to count, and impossible to
imagine.
• Ex: Weight, height
12/12/2023 SWE2009-Data Mining Techniques 22
Data: Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., disease)
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall,
medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)
12/12/2023 SWE2009-Data Mining Techniques 23
Categorical Attribute Types
• Nominal: Limited number of different values or list of possible values that the
variable may take. For example a variable Industry takes values such as financial,
engineering, retail etc.,
• Hair_color = {auburn, black, blond, brown, grey, red, white}
No order or
• marital status, occupation, ID numbers, zip codes, customer satisfactionhierarchy
surveys,
pizza toppings,
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., disease)

12/12/2023 SWE2009-Data Mining Techniques 24


Categorical Attribute Types
• Ordinal
• Values have a meaningful order (ranking) but magnitude between
successive values is not known. Follows
• Size = {small, medium, large}, some order
• Grades={S,A,B,C,D,E,F,N1,N2},
• army rankings={Second Lieutenant,Lieutenant, Captain, Major, Lieutenant Colonel, Colonel,
Brigadier,…..}

12/12/2023 SWE2009-Data Mining Techniques 25


Quantitative Attribute Types
• Numeric (integer or real-valued): it is a measurable quantity,
represented in integer or real values.
• Interval : measured in fixed and equal unit, Interval scales do not
have a natural zero.
• Scales that describe values where the interval between the values has meaning, whose
differences are interpretable
• Tempertures of the day (Celsius and Fahrenheit)
• If a day’s temperature of one day is twice of the other day we cannot say that one day is twice as
hot as another day.
• Calendar dates: the years 2014 and 2022 are eight years apart
• Times of the day (1pm, 2pm, 3pm, 4pm, etc.)
• IQ scores (100, 110, 120, 130, 140, etc.) measuring variables
• Credit ratings (20, 40, 60, 80, 100) on an equidistant
scale, 0 is arbitrary
• Dates (1740, 1840, 1940, 2040, 2140, etc.)

12/12/2023 SWE2009-Data Mining Techniques 26


Quantitative Attribute Types
• Ratio: fixed Zero-point, no negative values
• to compare the intervals or differences
• A ratio scale of bank account balance whose possible values are $5,
$10 and $15. The difference between each pair is $5 and $10 is twice
as much as $5. Since ratios of values are possible they are defined as
having a natural zero.
• Kelvin (K) temperature scale, count attributes such as years of
experience of an employee, number of words in a document
• Ratio can be added, subtracted, multiplied or divided but
interval data can only be added or subtracted

12/12/2023 SWE2009-Data Mining Techniques 27


Quantitative Attribute Types
• Discrete:
• Discrete data are a type of quantitative data that can take only fixed values.
• They are always numerical.
• These are data that can be counted, but not measured.
• Ex: a household survey- numbers of individuals who can live under one roof={1 or, 2
or, 3 ..} but not 3.4
• They are not whole numbers always, relatively imprecise, fixed float point number too eg; shoe size
of 7.5 is fixed but no 7.7
• Days in the month with a temperature measuring above 30 degrees
• Examples:
• The size of your department’s workforce.
• How many new clients you brought on board in the previous quarter?
• How many items are currently kept in stock?
12/12/2023 SWE2009-Data Mining Techniques 28
Quantitative Attribute Types
• Continuous:
• continuous data are not limited in the number of values they can take, placed
on an infinite number line
• Can potentially be measured with an ever-increasing degree of precision.
• The volume of a gas tank in liters
• Wind speed in miles per hour
• The height of buildings in meters
• Length of a rope in inches
• How much time is required to finish an activity or project

12/12/2023 SWE2009-Data Mining Techniques 29


12/12/2023 SWE2009-Data Mining Techniques 30
Attribute Hint Examples
type
Nominal Group of data with labels, No Pizza_toppings={Spinach, pepperoni, olives, tomatoes, cheese}
ordering Lliving_states{TN,AP,UP,MP,Kerala}
Binary Take two values, 0 or 1.. Symmetric: Gender={0-Male, 1-Female}, equal importance
Asymmetric: Diabetic={0-No, 1-Yes}, no equal importance, 1 is
more important than 0
For more import state keep 1
Ordinal Group of data with labels, Likert scale for importance of telemedicine:{1-very important,
ordered 2-important, 3-neutral, 4-unimportant, 5-very unimportant}
Interval Integer, equidistance, IQ-score no zero point
differences are interpretable, Temperature in Fahrenheit and celsius
no absolute zero
Ratio Equidistance, non-negative, Money with four accounts{ 20,40,60,80}
absolute zero, so differences Weight
are comparable Height{less than 5, 5.1 ft to 5.5ft, 5.6 to 6 ft, >6}
Kelvin scale
Discrete counted and has a limited Shoe sizes
number of values No. of clients in last year
unlimited number of different Daily temperature in your city
continuous values Individual’s weight

12/12/2023 SWE2009-Data Mining Techniques 31


Properties of attribute values

12/12/2023 SWE2009-Data Mining Techniques 32


Pop Quiz KDD
• What is the other name of Datamining?
• Determine the attribute type:
1. A list of a baseball team’s seasonal wins
2. Number of different vegetables in a crate 1. Discrete
3. The weight of a crate of vegetables in kilogram 2. Discrete
4. Credit ratings (20, 40, 60, 80, 100) 3. Continuous
5. How much time do you spend on social media per day? Possible 4. Interval
answers: 0-1 hours, 1-2 hours, 2-3 hours, 3-4 hours, 4-5 hours. 5. Ratio
6. What is today’s temperature in Celsius? Answers: -10, 0, +10, +20, +30. 6. Interval
7. How old are you? Answers: 18-24 years old, 25-34 years old 7. Ratio
8. Economic status (poor, middle income, wealthy) 8. Ordinal
9. Are you over 30 years of age? Possible answers: Yes, no 9. Binary
10. Age (child, teenager, young adult, middle-aged, retiree) 10.Ordinal
11. Personality type (introvert, extrovert, ambivert) 11.Nominal
12. Likert scales (Very satisfied, satisfied, neutral, dissatisfied, very 12.Ordinal
dissatisfied) 13.Nominal
13. Blood type (O negative, O positive, A negative, and so on) 14.Interval
14. State your annual income(20000 to 40000, 40000 to 60000, 60000 to
12/12/2023 80,000) SWE2009-Data Mining Techniques 33
Steps in Data mining Process
• Iterative steps in Data mining
• Data cleaning (to remove noise and inconsistent data)
• Data integration (where multiple data sources may be combined)
• Data selection (where data relevant to the analysis task are retrieved from the database)
• Data transformation (where data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations)
• Data mining (an essential process where intelligent methods are applied to extract data
patterns)
• Pattern evaluation (to identify the truly interesting patterns representing knowledge based
on interestingness measures)
• Knowledge representation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)

12/12/2023 SWE2009-Data Mining Techniques 35


Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records

• No quality data, no quality mining results!


• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.

• Data preparation, cleaning, and transformation comprises the majority of the work in a data
mining application (90%).
12/12/2023 SWE2009-Data Mining Techniques 36
Sample dataset CustomerID Genre
1Male
2Male
Age
19
21
Annual

15
15
Spending Score (1-
Income (k$) 100)
39
81
3Female 20 16 6
4Female 23 16 77
5Female 31 17 40
6Female 22 17 76
7Female 35 18 6
Independent 8Female 23 18 94
variable 9Male 64 19 3
10Female 30 19 72
11Male 67 19 14
12Female 35 19 99 Dependent variable
13Female 58 20 15
14Female 24 20 77
15Male 37 20 13
16Male 22 20 79
17Female 35 21 35
18Male 20 21 66
19Male 52 23 29
20Female 35 23 98
21Male 35 24 35
22Male 25 24 73
23Female 46 25 5
24Male 31 25 73
25Female 54 28 14
26Male 29 28 82
27Female 45 28 32
28Male 35 28 61
29Female 40 29 31
30Female 23 29 87
12/12/2023 SWE2009-Data Mining Techniques 37
Independent variable vs Dependent
variable
• The independent variable is the
cause.
• Its value is independent of other
variables.
• The dependent variable is the effect.
• Its value depends on changes in the
independent variable.

12/12/2023 SWE2009-Data Mining Techniques 38


Data Preprocessing tasks
• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers and noisy data, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, or files
• Data transformation
• Normalization and aggregation, Smoothing
• Data reduction
• Obtains reduced representation in volume but produces
the same or similar analytical results
• Data discretization (for numerical data)
• a method that converts the attribute values of continuous data
into a discrete collection of intervals while minimizing the
amount of data that is lost in the process
12/12/2023 SWE2009-Data Mining Techniques 39
Data Mining Process: Data cleaning
• Importance
• “Data cleaning is one of the three biggest problems in data warehousing”
—Ralph Kimball
• “Data cleaning is the number one problem in data warehousing”
—DCI survey

12/12/2023 SWE2009-Data Mining Techniques 40


Data Mining Process: Data cleaning

Fill in missing values: Noisy data


• Ignore the tuple • Binning methods: smoothening the
• Fill in the missing value manually
data by consulting neighbourhood
• Use a global constant to fill in the missing
value • Clustering: Identify outliers and
• Use the attribute mean to fill in the missing smooth out noisy data
value • Combined computer and human
• Use the attribute mean for all samples inspection
belonging to the same class as the given
tuple • Regression: finding best fit for the
• Use the most probable value to fill in the
data based on relationship among
missing value variables or features
12/12/2023 SWE2009-Data Mining Techniques 41
Sample Dataset
Name class Lectures Grades Credits Retake
Ram A Maths 90 6 No
Ram A Chemistry 54 6 No
Ram A Physics 77 6 No
Ram A History 22 5 Yes
Ram A Geography 25 5 Yes
Ram A German 70 4 No
Shyam B Maths 90 6 No
Shyam B Chemistry 90 6 No
Shyam B 90 6 No
Shyam B History 45 5 No
Shyam Geography 90 5 No
Shyam B German 90 4 No
Sheetal C Maths 70 6 No
Sheetal C Chemistry 44 6
Sheetal C Physics 56 6 No
Sheetal C History 77 5 No
Sheetal C Geography 35 5 Yes
Sheetal C German 99 4 No
Shylu D Maths 55 6 No
Shylu D Chemistry 67 6 No
Shylu D Physics 6 No
Shylu D History 90 5 No
Shylu D Geography 45 5 No
Shylu D German 67 4 No

12/12/2023 SWE2009-Data Mining Techniques 42


Data Cleaning Steps

12/12/2023 SWE2009-Data Mining Techniques 43


Data Transformation
• The data are transformed or consolidated into forms appropriate
• Clustering for outlier analysis
for mining

• Smoothing: Removing noise from data using binning, clustering,


regression techniques, etc.

• Aggregation: Summary operations are applied to data.


• For example, the daily sales data may be aggregated so as to compute
monthly and annual total amounts.

• Generalization of the data, where low level or 'primitive' (raw)


data are replaced by higher level concepts through the use of
concept hierarchies.
• For example, categorical attributes, like street, can be generalized to
higher level concepts, like city or county

• Normalization: Scaling of data to fall within a smaller range.

• Discretization: Raw values of numeric data are replaced by


intervals. For Example, Age.
12/12/2023 SWE2009-Data Mining Techniques 44
• Binning: discretize a numerical variable
Binning • first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

• Equal-width partitioning: • Equi-Width Binning for Data


• It divides the range into N intervals
Smoothing
of equal size: uniform grid
• if A and B are the lowest and • Given Data Set ; 5, 10, 11, 13, 15,
highest values of the attribute, the 35, 50 ,55, 72, 92, 204, 215
width of intervals will be: W = (B- • The formula for binning into
A)/N. equal-widths is = (max−min)/N
• The most straightforward
• Skewed data is not handled well.

•Identifying the Outliers


12/12/2023 SWE2009-Data Mining Techniques 45
•Resolving Inconsistencies
Data Transformation:Binning Methods for Data
Smoothing
• Equal-depth partitioning: a large number of observations or when the data is skewed
• It divides the range into N intervals, each containing approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins: * Smoothing by bin boundaries:
- Bin 1: 4, 8, 9, 15 - Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 24, 25 - Bin 2: 21, 21, 25, 25
- Bin 3: 26, 28, 29, 34 - Bin 3: 26, 26, 26, 34
* Smoothing by bin means: * Smoothing by bin median:
- Bin 1: 9, 9, 9, 9 - Bin 1: 8.5, 8.5, 8.5, 8.5
- Bin 2: 23, 23, 23, 23 - Bin 2: 22.5, 22.5, 22.5, 22.5
- Bin 3: 29, 29, 29, 29 - Bin 3: 28.5, 28.5, 28.5, 28.5
12/12/2023 SWE2009-Data Mining Techniques 46
Do the Following…

Data:
11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,40,45,45,45,71,7
2,73,75
Do the Following based on Equi-Depth, and Equi-Width Binning
methods by choosing 4 Bins
a) Smoothing by bin mean
b) Smoothing by bin median
c) Smoothing by bin boundaries

12/12/2023 SWE2009-Data Mining Techniques 47


Data reduction Aggregated
Data

• Applied to obtain a reduced representation of the data set


that is much smaller in volume, yet closely maintains the
integrity of the original data.
• Data cube aggregation: where aggregation operations are applied
to the data in the construction of a data cube.
• Dimension reduction : Reducing the number of attributes (Feature
selection), Decision tree induction,
• Data compression: Compressed representation of the original data:
Principal Component Analysis, Wavelet Transforms
• Discretization and concept hierarchy generation: where raw data
values for attributes are replaced by ranges or higher conceptual
levels.
• Concept hierarchies allow the mining of data at multiple levels of
abstraction, and are a powerful tool for data mining
12/12/2023 SWE2009-Data Mining Techniques 48
Datamining Task

Clu
s teri
Data
ng ng
Tid Refund Marital Taxable
Status Income Cheat

el i
od
1 Yes Single 125K No
2 No Married 100K No
M
ve
3 No Single 70K No

c ti
4 Yes Married 120K No
i
ed
5 No Divorced 95K Yes
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No

An
10 No Single 90K Yes

De oma
11 No Married 60K No

at i on 12 Yes Divorced 220K No

tec ly
oc i
13 No Single 85K Yes

s 14 No Married 75K No
ti o
As s 15 No Single 90K Yes n
le
10

Ru

Milk

12/12/2023 SWE2009-Data Mining Techniques 49


From Data to Knowledge

Numerical attribute categorical attribute missing values class labels

If (Headache=No AND Vomiting = Yes AND Temperature = High)


THEN Viral illness = Yes

12/12/2023 SWE2009-Data Mining Techniques


50
Pattern Evaluation and Knowledge
Representation
• Identifying interesting patterns representing the knowledge based on
interestingness measures.
• Data summarization and visualization methods are used to make the
data understandable by the user.
• Data visualization and knowledge representation tools are used to
represent the mined data. Data is visualized in the form of reports,
table

12/12/2023 SWE2009-Data Mining Techniques 51


Data mining task primitives

12/12/2023 SWE2009-Data Mining Techniques 52


Data Mining Task Primitives
• Refer to the basic building blocks or components that are used to
construct a data mining process
• Used to represent the most common and fundamental tasks that are
performed during the data mining process.
• Provide a modular and reusable approach, which can improve the
performance, efficiency, and understandability of the data mining
process.

12/12/2023 SWE2009-Data Mining Techniques 53


Data mining primitives
• A data mining query is defined in terms of the following primitives
• Task-relevant data:
• The database portion to be investigated.
• For example, Consider a manager of All Electronics in charge of sales in the US and
Canada, would like to study the buying trends of customers in Canada.
• Rather than mining on the entire database, particular data alone to as relevant
attributes
• The kinds of knowledge to be mined:
• specifies the data mining functions to be performed on the relevant data in order to
mine useful information, such as characterization, discrimination, association,
classification, clustering, or evolution analysis.
• For instance, if studying the buying habits of customers in Canada, Manager may
choose to mine associations between customer profiles and the items that these
customers like to buy
12/12/2023 SWE2009-Data Mining Techniques 54
Data mining primitives
• Background knowledge:
• Users can specify background knowledge, or knowledge about the domain to be
mined such as industry-specific terminology, trends, or best practices
• Useful for guiding the knowledge discovery process, and for evaluating the patterns found.
• Ex: concept hierarchies, and user beliefs about relationships in data in order to evaluate and
perform more efficiently.
• Interestingness measures:
• Refers to the methods and criteria used to evaluate the quality and relevance of
the patterns or insights discovered through data mining
• Functions used to separate uninteresting patterns from knowledge.
• Used to guide the mining process, or after discovery, to evaluate the discovered patterns.
• Evaluating the interestingness and interestingness measures such as utility,
certainty, and novelty for the data and setting an appropriate threshold value for
the pattern evaluation.

12/12/2023 SWE2009-Data Mining Techniques 55


Interestingness measure is associated with a threshold,
Interestingness measures
which may be controlled by the user

• “What makes a pattern interesting? Can a data mining system generate all of
the interesting patterns? Can a data mining system generate only interesting
patterns?” • Measures:
• support(X Y) = P(XUY)
• a pattern is interesting if it is • confidence(X Y) = P(Y | X)
• easily understood by humans,
• valid on new or test data with some degree of certainty, potentially useful, and novel.
• Can a data mining system generate all of the interesting patterns?
• Completeness of the data mining algorithms
• It is often unrealistic and inefficient for data mining systems to generate all of the possible
patterns
• Can a data mining system generate only interesting patterns?”
• an optimization problem in data mining.
• It is highly desirable for data mining systems to generate only interesting patterns.
• It would be more efficient for users and DM systems as it may reduce the search for
interesting patterns
12/12/2023 SWE2009-Data Mining Techniques 56
Interestingness measures • Nine specific criteria are used to determine
whether or not a pattern is interesting:
1. conciseness,
• Measures are intended for 2. coverage,
selecting and ranking patterns 3. reliability,
4. peculiarity,
according to their potential 5. diversity,
interest to the user 6. novelty,
• Good measures also allow the 7. surprisingness,
8. utility, and
time and space costs of the 9. actionability
mining process to be reduced
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.

Figure by Howard J. Hamilton


12/12/2023 SWE2009-Data Mining Techniques 57
Interestingness measures
1. Conciseness: A pattern is concise if it contains relatively few attribute-value pairs, while a set of
patterns is concise if it contains relatively few patterns Easy to understand and remember
2. Generality/coverage: A pattern is general if it covers a relatively large subset of a dataset
• Measures the comprehensiveness of a pattern the fraction of all records in the dataset that matches the pattern
• If a pattern characterizes more information in the dataset, it tends to be more interesting
• Bread  Milk occurring in all transactions in a supermarket

3. Reliability: A pattern is reliable if the relationship described by the pattern occurs in a high percentage
of applicable cases highly accurate classification or higher confidence associations
• Bread Milk occurring in all feasible transactions in all applications

4. Peculiarity: A pattern is peculiar if it is far away from other discovered patterns according to some
distance measure.
• Peculiar patterns are generated from peculiar data (or outliers), which are relatively few in number and
significantly different from the rest of the data
• Peculiar patterns may be unknown to the user

12/12/2023 SWE2009-Data Mining Techniques 58


Interestingness measures
5. Diversity: A pattern is diverse if its elements differ significantly from each other, while a set of
patterns is diverse if the patterns in the set differ significantly from each other.
6. Diversity is a common factor for measuring the interestingness of summaries

6. Novelty: A pattern is novel to a person if he or she did not know it before and is notable to infer it
from other known patterns.
• a novel pattern is new and not contradicted by any pattern already known to the use

7. Surprisingness: A pattern is surprising (or unexpected) if it contradicts a person’s existing


knowledge or expectations
6. contradicts the user’s previous knowledge or expectations
7. Bread Milk is usual buying pattern based on experience , but user is buying litchi juice with milk, not
available in purchasing pattern

8. Utility: A pattern is of utility if its use by a person contributes to reaching a goal


9. Actionability/applicability: A pattern is actionable (or applicable) in some domain if it enables
decision making about future actions in this domain
12/12/2023 SWE2009-Data Mining Techniques 59
Data mining primitives
• Presentation and visualization of discovered patterns:
• Refers to the form in which discovered patterns are to be displayed.
• Users can choose from different forms for knowledge presentation, such as
rules, tables, charts, graphs, decision trees, and cubes.

12/12/2023 SWE2009-Data Mining Techniques 60


Architecture of Typical Data Mining System

12/12/2023 SWE2009-Data Mining Techniques 61


Architecture of a typical data mining system
• Database, data warehouse, World Wide Web, or other information
repository:
• One or a set of databases, data warehouses, spreadsheets, or other kinds of information
repositories.
• Data cleaning and data integration techniques may be performed on the data.
• Database or data warehouse server:
• Responsible for fetching the relevant data, based on the user’s data mining request.
• Knowledge base:
• Knowledge is used to guide the search or evaluate the interestingness of resulting
patterns.
• knowledge can include concept hierarchies, used to organize attributes or attribute
values into different levels of abstraction.
• Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included.
12/12/2023 SWE2009-Data Mining Techniques 62
Architecture of a typical data mining system
• Data mining engine:
• Consists of a set of functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.
• Pattern evaluation module:
• To focus the search toward interesting patterns.
• To filter out discovered patterns.
• The pattern evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used.
• For efficient data mining, it is highly recommended to push the evaluation of pattern
interestingness as deep as possible into the mining process so as to confine the search to
only the interesting patterns.

12/12/2023 SWE2009-Data Mining Techniques 63


Architecture of a typical data mining system
• User interface:
• Communicates between users and the data mining system
• Allow the user to interact with the system by specifying a data mining query
or task
• Provide information to help focus the search
• Performing exploratory data mining based on the intermediate data mining
results.
• Allow the user to browse database and data warehouse schemas or data
structures, evaluate mined patterns, and visualize the patterns in different
forms.

12/12/2023 SWE2009-Data Mining Techniques 64


1. Data smoothing ,
removing noise
2. Task relevant data
Pop Quiz-2 3. Reliability
4. False
5. True
• Binning is used for ________
• Which data mining primitives provides the subset of the features
required to complete the process?
• Which interesting measure assures highly accurate classification of
patterns?
• Say True or False:
• Data reduction task of data mining reduces the integrity of the data by
reducing the attributes of the data set
• Changes in the Independent variables determines the value of dependent
variables.

12/12/2023 SWE2009-Data Mining Techniques 65


Evolution of Database Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS

• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering,
etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and
Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global
information systems
66
Data Mining: Classification Schemes
• General functionality
• Descriptive data mining
• Predictive data mining

• Different views, different classifications


• Kinds of databases to be mined
• Kinds of knowledge to be discovered
• Kinds of techniques utilized
• Kinds of applications adapted

12/12/2023 SWE2009-Data Mining Techniques 67


Data Mining Classification Schemes
• Prediction Methods
• using some variables to predict unknown or future values of other variables
• Descriptive Methods
• finding human-interpretable patterns describing the data

12/12/2023 SWE2009-Data Mining Techniques 68


Descriptive and Predictive Data Mining
BASIS FOR DESCRIPTIVE MINING PREDICTIVE MINING
COMPARISON
• The descriptive analysis is used to mine Basic It identifies, what happened in the It describes, what can happen in
data and provide the latest information past by analyzing stored data the future with the help past data
on past or recent events. analysis.

• what happened?
• where exactly is the problem? Require Data aggregation and data mining Statistics and forecasting methods

• what is the frequency of the


problem? Preciseness Provides accurate data Produces results does not ensure
• The predictive analysis provides answers accuracy.
of the future queries that move across
using historical data as the chief Type of approach Reactive Proactive
principle for decisions. Practical analysis Standard reporting, query/drill Predictive modelling, forecasting,
• what will happen next? methods down and ad-hoc reporting. simulation and alerts.

• what is the outcome if these trends


continue?
• what actions are required to be
taken?
Multi-Dimensional View of Data Mining
• Data to be mined
• Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-
series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge to be mined
• Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis,
etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Machine learning, statistics, visualization, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining,
Web mining, etc.

12/12/2023 SWE2009-Data Mining Techniques 70


Data Mining Functionalities
• Data mining functionalities specify the kind of patterns to be found in
data mining tasks.
• data mining tasks can be classified into two categories: descriptive
and predictive.
• Descriptive mining tasks characterize the general properties of the data in
the database.
• Predictive mining tasks perform inference on the current data in order to
make predictions
Data Mining Functionalities
• But some times users may not know what kinds of patterns in their data
may be interesting
• Users like to search for several different kinds of patterns in parallel
• Data mining systems should be able to
• Discover patterns at various granularity , different levels of abstraction.
• Allow users to specify hints to focus the search for interesting patterns.
• Because some patterns may not hold for all of the data in the database, a
measure of certainty or “trustworthiness” is usually associated with each
discovered pattern.
Data Mining Functionalities
DM Functionalities1: Characterization or discrimination
DM Functionalities1: Characterization or
discrimination

Online analytical Processing-used for


multi dimensional anlysis model of
data helps in user to seect data from
different points of view
Data Summaries: OLAP Operations
marks secured by a batch of students in a class test
OLAP Processing
a two-way table, in which there are two
characteristics, namely, marks secured by the
students in the test and the gender of the students

a three – way table with three factors, namely,


marks, gender and location.
DM Functionalities1: Characterization or discrimination
• The output of data characterization can be presented in various
forms.
• Examples include pie charts, bar charts, curves, multidimensional data
cubes, and multidimensional tables.

• The resulting descriptions can also be presented as generalized


relations or in rule form.
DM Functionalities1: Characterization or
discrimination
• Example: A data mining system should be able to produce a
description summarizing the characteristics of customers who spend
more than $1,000 a year at AllElectronics.
• The result could be a general profile of the customers, such as they
are 40–50 years old, employed, and have excellent credit ratings.
• The system should allow users to drill down on any dimension, such
as on occupation in order to view these customers according to their
type of employment.
DM Functionalities1: Characterization or
discrimination
• Data discrimination is a comparison of the general features of target
class data objects with the general features of objects from one or a
set of contrasting classes.
• The target and contrasting classes can be specified by the user, and
the corresponding data objects retrieved through database queries.
• For example, the user may like to compare the general features of
software products whose sales increased by 10% in the last year with
those whose sales decreased by at least 30% during the same period.
DM Functionalities1: Characterization or
discrimination
• Example of Data discrimination:
• A data mining system should be able to compare two groups of AllElectronics
customers, such as those who shop for computer products regularly versus
those who rarely shop for such products
• 80% of the customers who frequently purchase computer products are
between 20 and 40 years old and have a university education 60% of the
customers who infrequently buy such products are either seniors or youths,
and have no university degree.
• Drilling down on a dimension, such as occupation, or adding new
dimensions, such as income level, may help in finding even more
discriminative features between the two classes.
Data Mining Functionalities: (2) Association and
Correlation Analysis
• Frequent patterns (or frequent itemsets)
• What items are frequently purchased together in your Walmart?
• Association, correlation vs. causality
• A typical association rule
• Diaper  Beer [0.5%, 75%] (support, confidence)
• Are strongly associated items also strongly correlated?
• How to mine such patterns and rules efficiently in large datasets?
• How to use such patterns for classification, clustering, and other applications? By
using measures
• (frequent) sequential pattern- A substructure can refer to
different structural forms, such as graphs, trees, or lattices, which
may be combined with item sets or subsequences.
• If a substructure occurs frequently, it is called a (frequent)
structured pattern 82
Association analysis

• Suppose that, a marketing manager at AllElectronics, want to know which items


are frequently purchased together (i.e., within the same transaction).
• An example of such a rule, mined from the AllElectronics transactional database,
is
• buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence = 50%],
• where X is a variable representing a customer.
• A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50%
chance that she will buy software as well.
• A 1% support means that 1% of all the transactions under analysis show that computer and
software are purchased together.
• This association rule involves a single attribute or predicate (i.e., buys) that repeats.
• Association rules that contain a single predicate are referred to as single-dimensional association
rules.
• Dropping the predicate notation, the rule can be written simply as
• “computer ⇒ software [1%, 50%].”
Association analysis
• Suppose, instead, that we are given the AllElectronics relational
database related to purchases.
• A data mining system may find association rules like
• age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”) [support = 2%,
confidence = 60%].
• The rule indicates that of the AllElectronics customers under study, 2% are 20 to 29 years
old with an income of $40,000 to $49,000 and have purchased a laptop (computer) at
AllElectronics.
• There is a 60% probability that a customer in this age and income group will purchase a
laptop.
• Note that this is an association involving more than one attribute or predicate (i.e., age,
income, and buys).
• Adopting the terminology used in multidimensional databases, where each attribute is
referred to as a dimension, the above rule can be referred to as a multidimensional
association rule.
Discard Association Rules
• Association rules are discarded as uninteresting if they do not satisfy
both a minimum support threshold and a minimum confidence
threshold.
• Additional analysis can be performed to uncover interesting statistical
correlations between associated attribute–value pairs.
Supervised Learning Unsupervised Learning
• defined by its use of labeled datasets • Used to analyze and cluster unlabeled
• These datasets are designed to train data sets
or “supervise” algorithms into • algorithms discover hidden patterns in
classifying data or predicting data without the need for human
outcomes accurately intervention
• separated into two types of • Used for three main tasks:
• Clustering: grouping unlabeled data
problems when data mining: based on their similarities or differences
• Classification: Assign test data to a • Association: uses different rules to find
specific group relationships between variables in a
• Regression: Understand relationship given dataset
between dependent and independent • Dimensionality reduction: used when
variables (Predicting a numerical the number of features (or dimensions)
values) in a given dataset is too high, reduces
dimensions by preserving data integrity
12/12/2023 SWE2009-Data Mining Techniques 86
Data Mining Functionalities: (3)
Classification and Prediction
• Classification and label prediction
• Construct models (functions) based on some training examples
• Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas mileage)
• Predict some unknown class labels
• Typical methods
• Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification,
• Regression analysis is used for prediction
• Typical applications:
• Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages
87
Classification and Regression for
Predictive Analysis
• Classification is the process of finding a model (or
function) that describes and distinguishes data classes or
concepts.
• The model are derived based on the analysis of a set of
training data (i.e., data objects for which the class labels
are known).
• The model is used to predict the class label of objects for
which the class label is unknown.
• “How is the derived model presented?” The derived
model may be represented in various forms, such as
classification rules (i.e., IF-THEN rules), decision trees,
mathematical formulae, or neural networks
Classification and Regression for
Predictive Analysis
• A decision tree is a flowchart-like tree structure, where
each node denotes a test on an attribute value, each
branch represents an outcome of the test, and tree
leaves represent classes or class distributions.
• Decision trees can easily be converted to classification rules.
• A neural network, when used for classification, is Predict the amount of revenue that each item
typically a collection of neuron-like processing units will generate during an upcoming sale at
AllElectronics, based on previous sales data.
with weighted connections between the units.
• There are many other methods for constructing
classification models, such as na¨ıve Bayesian classification,
support vector machines, and k-nearest-neighbor
classification.
Classification and Regression for
Predictive Analysis
• classification predicts categorical • Regression models continuous-valued
(discrete, unordered) labels functions.
• Regression is used to predict missing
• Classification and regression may need to be or unavailable numerical data values
preceded by relevance analysis, which rather than (discrete) class labels.
attempts to identify attributes that are
significantly relevant to the classification and • Regression analysis is a statistical
regression process. methodology that is most often used
• Such attributes will be selected for the for numeric prediction, although other
classification and regression process. methods exist as well.
• Other attributes, which are irrelevant, can • Regression also encompasses the
then be excluded from consideration.
identification of distribution trends
Eg: To predict whether it is going to be hot or cold based
Eg: on the
To predict available
tomorrow's data.using a weather
temperature
tomorrow, use a classification algorithm dataset, use a regression algorithm
Classification and Regression for
Predictive Analysis
• classification predicts categorical • Regression analysis (statistical
(discrete, unordered) labels, methodology)
• prediction models continuous-
valued functions.
• Regression is used to predict
missing or unavailable numerical
data values.
Classification and Regression for
Predictive Analysis
• Classification • Regression
• Suppose a sales manager of AllElectronics • Suppose instead, that rather than predicting
want to classify a large set of items in the categorical response labels for each store item,
store, based on three kinds of responses to a you would like to predict the amount of revenue
sales campaign: good response, mild that each item will generate during an upcoming
sale at AllElectronics, based on the previous sales
response and no response. data.
• Each of these three classes have descriptive • Regression analysis
features of the items, such as price, brand, • because the regression model constructed will predict a
place made, type, and category. continuous function (or ordered value.)

• The resulting classification should maximally


distinguish each class from the others, • The decision tree may identify price as being the single
presenting an organized picture of the data factor that best distinguishes the three classes.
set.  decision tree • The tree may reveal that, in addition to price, other
features that help to further distinguish objects of each
class from one another include brand and place made.
Data Mining Functionalities: (4) Cluster
Analysis
• Unsupervised learning (i.e., Class label is unknown)
• Group data to form new categories (i.e., clusters), e.g., cluster houses
to find distribution patterns
• Principle: Maximizing intra-class similarity & minimizing interclass
similarity
• Many methods and applications

• Clustering is a process of partitioning a set of data


(or objects) in a set of meaningful sub-classes, called
clustersHelps users understand the natural grouping
or structure in a data set
93
Cluster Analysis

• Unlike classification and regression, which analyze class-labeled (training) data


sets, clustering analyzes data objects without consulting class labels.
• In many cases, class labeled data may simply not exist at the beginning.
• Used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of maximizing the
intra class similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are rather dissimilar to objects in other clusters.
• Each cluster formed can be viewed as a class of objects, from which rules can be
derived.
• Clustering can also facilitate taxonomy formation, that is, the organization of
observations into a hierarchy of classes that group similar events together.
Cluster Analysis
• Cluster analysis can be performed on
AllElectronics customer data to identify
homogeneous subpopulations of
customers.
• These clusters may represent individual
target groups for marketing.
• Figure shows a 2-D plot of customers
with respect to customer locations in a
city. Three clusters of data points are
evident.
Data Mining Function: (5) Outlier
Analysis
• A data set may contain objects that do not comply with the general behavior or model
of the data.
• These data objects are outliers.
• Outlier analysis
• Outlier: A data object that does not comply with the general behavior of the data
• Noise or exception? ― One person’s garbage could be another person’s treasure.
Noise should be removed before outlier detection
• Noise doesnot have meaning but outlier does
• Methods: by product of clustering or regression analysis,
• Useful in fraud detection, rare events analysis
• Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases
of unusually large amounts for a given account number in comparison to regular
charges incurred by the same account.
• Outlier values may also be detected with respect to the locations and types of
purchase, or the purchase frequency 96
Data Mining Function: (6) Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
• Data evolution analysis describes and models regularities or trends for
objects whose behavior changes over time.
• It may include
• Characterization and discrimination
• Association and correlation analysis
• Classification and prediction
• Clustering of time related data

• Analysis include
• Time-series data analysis
• Sequence or periodicity pattern matching
• Similarity-based data analysis.

97
Data Mining Function: (6) Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
• Sequence, trend and evolution analysis
• Trend, time-series, and deviation analysis: e.g., regression and value
prediction
• Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory cards
• Periodicity analysis
• Motifs and biological sequence analysis
• Approximate and consecutive motifs (a decorative image or design, especially
a repeated one forming a pattern.)
• Similarity-based analysis
• Mining data streams
• Ordered, time-varying, potentially infinite, data streams
98
Pop-Quiz
• OLAP is_____
• OLTP is_____
• OLAP is used for _____
• Prediction is done using____
• Clustering using ____ data
• Classification and Regression uses _____data
• Which is supervised learning?
• Supervised learning applications are:
• Unsupervised learning applications:
• Identify whether the product A($2, “SUPERSONIC”, Disney, Toy, 5yrs+) will attract more
customers in the upcoming sale?
• Determine the revenue that will be developed in an upcoming sale by each product in Amazon?

12/12/2023 SWE2009-Data Mining Techniques 99


Integration of a data mining system with a
database system
• Integration of DM system with DB or DW systems will improve
efficiency
• Four integration Schemes:
• No coupling
• Loose Coupling
• Semitight Coupling
• Tight Coupling

12/12/2023 SWE2009-Data Mining Techniques 100


DM integration with DB: Schemes
• No coupling: No integration with DB or DW
• will not use any function of a database or data warehouse system
• Gets data by communicating through other storage medias directly like file
systems, process using DM algorithms and stores results in another file
• This will be time consuming and looses the flexibility\adaptability offered by DB
or DW systems
• Loose Coupling: Lesser Integration
• Uses some functionalities of DB or DW to integrate for retrieving data
• Retrieves data using those functions, processes using DM algorithms and stores
results in designated area of DB or Dw or in a file
• Suitable for smaller datasets
• Better than No coupling scheme
12/12/2023 SWE2009-Data Mining Techniques 101
DM integration with DB: Schemes
• Semitight Coupling: More than Half is integrated to DB or DW
• Adequate execution of a few essential data mining primitives can be supported in the
DB or DW system
• sorting, indexing, aggregation, histogram analysis, multi-way join, and pre-
computation of some important statistical measures, including sum, count, max, min,
standard deviation, etc.
• Tight Coupling: Complete Integration or smoothly integrated
• Fully integrated with DB or DW and Most efficient of all
• DB or DW will be part of DM system or data mining subsystem is one functional
element of information system
• Data mining queries and functions are developed and established on mining query
analysis, data structures, indexing schemes, and query processing methods of
DB/DW systems.
12/12/2023 SWE2009-Data Mining Techniques 102
Major Issues in Data Mining (1)

• Mining Methodology
• Mining various and new kinds of knowledge Mining different kinds of
• Mining knowledge in multi-dimensional space knowledge in databases:

• Data mining: An interdisciplinary effort


• Boosting the power of discovery in a networked environment
• Handling noise, uncertainty, and incompleteness of data
• Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
• Interactive mining Interactive mining of
knowledge at multiple
• Incorporation of background knowledge levels of abstraction
• Presentation and visualization of data mining results
104
Major Issues in Data Mining (2)

• Efficiency and Scalability


Incorporation of
• Efficiency and scalability of data mining algorithms background knowledge
• Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
• Handling complex types of data
• Mining dynamic, networked, and global data repositories
• Mining information from heterogeneous databases and global information systems (WWW)
• Data mining and society
• Social impacts of data mining
• Privacy-preserving data mining
• Invisible data mining
• Application of discovered knowledge
• Domain-specific data mining tools

105
Steps in Data Mining-Project Perspective
1. Develop an understanding of the purpose of the data mining project.
How will the stakeholder use the results? Who will be affected by the
results? Will the analysis be a one-shot effort or an ongoing procedure?
2. Obtain the dataset to be used in the analysis.
• This often involves sampling from a large database to capture records to be used
in an analysis. It may also involve pulling together data from different databases
or sources.
• The databases could be internal (e.g., past purchases made by customers) or
external (credit ratings). While data mining deals with very large databases,
usually the analysis to be done requires only thousands or tens of thousands of
records.

12/12/2023 SWE2009-Data Mining Techniques 106


Steps in Data Mining-Project Perspective
3. Explore, clean, and preprocess the data.
• Involves verifying that the data are in reasonable condition.
• How should missing data be handled?
• Are the values in a reasonable range, given what you would expect for each
variable?
4. Reduce the data dimension, if necessary.
• Dimension reduction can involve operations such as eliminating unneeded
variables, transforming variables (e.g., turning “money spent” into “spent >
$100” vs. “spent $100”), and creating new variables (e.g., a variable that
records whether at least one of several products was purchased).

12/12/2023 SWE2009-Data Mining Techniques 107


Steps in Data Mining-Project Perspective
5. Determine the data mining task.
• (classification, prediction, clustering, etc.).
6. Partition the data (for supervised tasks).
• If the task is supervised (classification or prediction), randomly partition the
dataset into three parts: training, validation, and test datasets.
7. Choose the data mining techniques to be used.
• (regression, neural nets, hierarchical clustering, etc.).

12/12/2023 SWE2009-Data Mining Techniques 108


Steps in Data Mining-Project Perspective
8. Use algorithms to perform the task.
9. Interpret the results of the algorithms.
• Involves making a choice as to the best algorithm to deploy, and where
possible, testing the final choice on the test data to get an idea as to how well
it will perform.
10. Deploy the model.
• Step involves integrating the model into operational systems and running it
on real records to produce decisions or actions.

12/12/2023 SWE2009-Data Mining Techniques 109

You might also like