III-IT-Data Mining Unit 1-Session 3
III-IT-Data Mining Unit 1-Session 3
Unit 1-Session 3
CO1: Identify the types of data to be pre-processed for
the given dataset using the preprocessing
technique.
LO1.1: Describe about Data mining and its
functionalities
SO1.1.6: List the different types of attributes with
example.
2
Data Mining
Unit I – INTRODUCTION
• Introduction- Different Kinds of Data
• Patterns Mined –Applications
• Attribute Types
• Data Preprocessing: Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
• Data Discretization
• Data Visualization
Data Mining 3
Types of Data Sets
• Record
• Relational records
• Data matrix, e.g., numerical matrix, crosstabs
• Document data: text documents: term-frequency
vector
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
• Transaction data
• Graph and network
• World Wide Web Document 1 3 0 5 0 2 6 0 2 0 2
Data Mining 4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points,
objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Data Mining 5
Attributes
• Attribute (or dimensions, features, variables): a
data field, representing a characteristic or feature of a
data object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Ordinal
• Binary
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
Data Mining 6
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude
between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
Data Mining 7
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts,
monetary quantities
Data Mining 8
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and
represented using a finite number of digits
• Continuous attributes are typically represented as
floating-point variables
Data Mining 9
Quiz
Data Mining 10
Answer
• Employee Designation-Ordinal
• Pizza_Taste -Ordinal
• Calendar_Dates-Interval scaled
• Street numbers-Ordinal
• Eye color-Nominal
Data Mining 11
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Data Mining 12
Data Matrix and Dissimilarity
Matrix
• Data matrix
• n data points with p x11 ... x1f ... x1p
dimensions
... ... ... ... ...
• Two modes x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
• Dissimilarity matrix
• n data points, but 0
registers only the d(2,1) 0
distance d(3,1) d ( 3,2) 0
• A triangular matrix
: : :
• Single mode d ( n,1) d ( n,2) ... ... 0
Data Mining 13
Proximity Measure for Nominal Attributes
Data Mining 14
Proximity Measure for Binary Attributes
Object j
Data Mining 15
Dissimilarity between Binary
Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
• Gender is a symmetric attribute
• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
Data Mining 16
Standardizing Numeric Data
z x
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
• the distance between the raw score and the population mean in units
of the standard deviation
• negative when the raw score is below the mean, “+” when above
Data Mining 17
Ordinal Variables
• Attributes
• Attributes types
• Similarity and Dissimilarity
Data Mining 19
Reference
1. Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining:
Concepts and Techniques”, 3rd Edition, Elsevier, 2014.
2. Jure Leskovec, Anand Rajaraman, Jeffery David
Ullman, “Mining of Massive Datasets”, 2nd Edition,
Cambridge University Press, 2014.
3. Ian H.Witten, Eibe Frank, Mark A.Hall, “Data Mining:
Practical Machine Learning Tools and Techniques”, 3rd
Edition, Elsevier, 2011.
Data Mining 20
Thank you
Data Mining 21