0% found this document useful (0 votes)
4 views

III-IT-Data Mining Unit 1-Session 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

III-IT-Data Mining Unit 1-Session 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Mining

Unit 1-Session 3
CO1: Identify the types of data to be pre-processed for
the given dataset using the preprocessing
technique.
LO1.1: Describe about Data mining and its
functionalities
SO1.1.6: List the different types of attributes with
example.

2
Data Mining
Unit I – INTRODUCTION
• Introduction- Different Kinds of Data
• Patterns Mined –Applications
• Attribute Types
• Data Preprocessing: Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
• Data Discretization
• Data Visualization

Data Mining 3
Types of Data Sets
• Record
• Relational records
• Data matrix, e.g., numerical matrix, crosstabs
• Document data: text documents: term-frequency
vector

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
• Transaction data
• Graph and network
• World Wide Web Document 1 3 0 5 0 2 6 0 2 0 2

• Social or information networks Document 2 0 7 0 2 1 0 0 3 0 0


• Molecular Structures
Document 3 0 1 0 0 1 2 2 0 3 0
• Ordered
• Video data: sequence of images
• Temporal data: time-series TID Items
• Sequential Data: transaction sequences 1 Bread, Coke, Milk
• Genetic sequence data 2 Beer, Bread
• Spatial, image and multimedia: 3 Beer, Coke, Diaper, Milk
• Spatial data: maps 4 Beer, Bread, Diaper, Milk
• Image data 5 Coke, Diaper, Milk
• Video data

Data Mining 4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points,
objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Data Mining 5
Attributes
• Attribute (or dimensions, features, variables): a
data field, representing a characteristic or feature of a
data object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Ordinal
• Binary
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled

Data Mining 6
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude
between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
Data Mining 7
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts,
monetary quantities

Data Mining 8
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and
represented using a finite number of digits
• Continuous attributes are typically represented as
floating-point variables

Data Mining 9
Quiz

• Identify the type of attributes for the following


data objects:
• Employee Designation
• Pizza_Taste
• Calendar_Dates
• Street numbers
• Eye color

Data Mining 10
Answer

• Employee Designation-Ordinal
• Pizza_Taste -Ordinal
• Calendar_Dates-Interval scaled
• Street numbers-Ordinal
• Eye color-Nominal

Data Mining 11
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

Data Mining 12
Data Matrix and Dissimilarity
Matrix
• Data matrix
• n data points with p  x11 ... x1f ... x1p 
dimensions  
 ... ... ... ... ... 
• Two modes x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
• Dissimilarity matrix
• n data points, but  0 
registers only the  d(2,1) 0 
 
distance  d(3,1) d ( 3,2) 0 
• A triangular matrix  
 : : : 
• Single mode d ( n,1) d ( n,2) ... ... 0

Data Mining 13
Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow, blue,


green (generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables
d (i, j)  p 
p
m

• Method 2: Use a large number of binary attributes


• creating a new binary attribute for each of the M nominal
states

Data Mining 14
Proximity Measure for Binary Attributes
Object j

• A contingency table for binary data


Object i

• Distance measure for symmetric


binary variables:

• Distance measure for asymmetric


binary variables:

• Jaccard coefficient (similarity measure


for asymmetric binary variables):

Data Mining 15
Dissimilarity between Binary
Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
• Gender is a symmetric attribute
• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2

Data Mining 16
Standardizing Numeric Data

z  x
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
• the distance between the raw score and the population mean in units
of the standard deviation
• negative when the raw score is below the mean, “+” when above

Data Mining 17
Ordinal Variables

• An ordinal variable can be discrete or continuous


• Order is important, e.g., rank
• Can be treated like interval-scaled
rif {1,...,M f }
• replace xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1

• compute the dissimilarity using methods for interval-


scaled variables
Data Mining 18
Summary

• Attributes
• Attributes types
• Similarity and Dissimilarity

Data Mining 19
Reference
1. Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining:
Concepts and Techniques”, 3rd Edition, Elsevier, 2014.
2. Jure Leskovec, Anand Rajaraman, Jeffery David
Ullman, “Mining of Massive Datasets”, 2nd Edition,
Cambridge University Press, 2014.
3. Ian H.Witten, Eibe Frank, Mark A.Hall, “Data Mining:
Practical Machine Learning Tools and Techniques”, 3rd
Edition, Elsevier, 2011.

Data Mining 20
Thank you

Data Mining 21

You might also like