L - 2 - Data Scale
L - 2 - Data Scale
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Type
• Discrete data:
– Discrete non-ordered numbers
– Random collection of words
– Unrelated audio sounds
– Random music notes
• Sequential (temporal) data: Sequential
– Stochastic process Spatio-temporal
– Sequence of words in a sentence data
– Audio speech data
– Music
• Spatial data: Other classifications include
• Categorical vs numerical
– Image data • Qualitative vs Quantitative
– Geo-spatial data
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
• Same numerical data may have different semantic meanings
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
• Based on semantic meanings there are four different scales
• For each scale level the operations and statistics of the lower
scale levels are also valid
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
For each scale level the operations and statistics of the lower scale levels are also valid
• Nominal scaled data
– Only tests for equality or non-equality are valid.
– Data of a nominal feature can be represented by the mode (value
that occurs most frequently.)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
For each scale level the operations and statistics of the lower scale levels are also valid
• Ordinal scaled data
– The operations “greater than” and “less than” are valid
– inequality, and the combinations “greater than or equal” (≥)and “less than or equal”
(≤).
– The relation “less than or equal” (≤) defines a total order, such that for any x; y; z we have
• Antisymmetry
• Transitivity
• Totality
– Represented by the median (the value for which (almost) as many smaller as larger values exist)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
For each scale level the operations and statistics of the lower scale levels are also valid
• Interval scaled data
– addition and subtraction are valid
– have arbitrary zero points
– represented by the (arithmetic) mean
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Scales
For each scale level the operations and statistics of the lower scale levels are also valid
• Ratio scaled data
– multiplication and division are valid
– represented by the generalized mean
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Type, Data Scale, Data value
Date Type, Data Scale and Data values are three different concepts
• Data Type:
– Discrete Type
• Order of collection does not matter
– Sequential Type
• One directional order of collection These can be of any Data Scale
– Spatio-temporal Type
• Multidimensional order of collection
• Data Scale
– Ratio ->Can be only numerical (also called quantitative)
– Interval -> Can be only numerical (also called quantitative)
– Ordinal -> Can be categorical or Qualitative
– Nominal -> Can be only categorical
• Data value
– Discrete (numerical or non-numerical)
– Continuous (numerical also called quantitative)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Type, Data Scale, Data value
Date Type, Data Scale and Data values are three different concepts
• Data Type:
– Discrete Type
• Order of collection does not matter
– Sequential Type
• One directional order of collection These can be of any Data Scale
– Spatio-temporal Type
• Multidimensional order of collection
• Data Scale
– Ratio ->Can be only numerical (also called quantitative)
– Interval -> Can be only numerical (also called quantitative)
– Ordinal -> Can be categorical or Qualitative
– Nominal -> Can be only categorical (?)
• Data value
– Discrete (numerical or non-numerical)
– Continuous (numerical also called quantitative)
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
1985 Auto Imports Database
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Abalone (sea snails) data
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Census bureau database
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate,
5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct,
Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South,
China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos,
Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong,
Holand-Netherlands.
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Variables in ML
• The inputs go by different names, such as
predictors, independent variables, features, or
sometimes just variables and is typically
denoted using the symbol X
• The output variable is often called the
response or dependent variable, and is
typically denoted using the symbol Y
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Supervised Machine Learning
•
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Regression vs Classification
•
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Regression vs Classification
•
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Set vs Matrix Representations
We can denote numerical feature data as a set
X={x1,x2, ..,xn} ϵ Rpxn
• with n elements, where
• each element is a p-dimensional real-valued
feature vector, where n and p are positive
integers. For p = 1 we call X a scalar data set.
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining
Data Set and Matrix Representations
• As an alternative to the set representation, numerical feature data
are also often represented as a matrix
Asim Tewari, IIT Bombay ME 781: Statistical Machine Learning and Data Mining