02 DataCategorization
02 DataCategorization
(CS61061)
Lecture #2
Data Categorization
NOIR topology
Nominal scale
Binary
Symmetric
Asymmetric
Ordinal scale
N: Nominal
O: Ordinal
I: Interval
R: Ratio
Alphabetical
Binary Ternary Others
Ordered Discrete
Numerically
Symmetric Continuous
Ordered
Literally
Asymmetric
Ordered
1. Distinctiveness = and ≠
Categorical
(Qualitative)
2. Order <,≤,>,≥
3. Addition + and -
Numerical
(Quantitative)
4. Multiplication * and /
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 9
NOIR summary
Nominal (with distinctiveness property only)
Examples
Gender Used letters or numbers
{ M, F} or { 1, 0 }
Country code ??
????
Labels (from two different attributes) can be combined to give another nominal
variable.
For example, blood group with Rh factor ( A+ , A- , AB+, etc.)
Examples
Switch: {ON, OFF}
Attendance: {True, False}
Entry: {Yes, No}
etc.
Note
A Binary variable is a special case of a nominal variable that takes only two
possible values.
The allowed operations are : accessing (read, check, etc.) and re-coding (into
another non-overlapping symbol set, that is, one-to-one mapping), etc.
Nominal data can be visualized using line charts, bar charts or pie charts etc.
Note
The values assumed by an ordinal variable can be ordered
among themselves as each pair of values can be compared
literally or using relational operators ( < , ≤ , > , ≥ ).
Ordinal data can be ranked (numerically, alphabetically, etc.) Hence, we can find
any of the percentiles measures of ordinal data.
Calculations based on order are permitted (such as count, min, max, etc.).
Numerical variable can be transformed into ordinal variable and vice-versa, but
with a loss of information.
For example, Age [1, … 100] = [young, middle-aged, old]
Note
Interval data are with well-defined interval.
Interval data are measured on a numeric scale (with +ve, 0 (zero), and –ve
values).
Interval data has a zero point on origin. However, the origin does not imply a
true absence of the measured characteristics.
For example, temperature in Celsius and Fahrenheit; 0⁰ does not mean absence of
temperature, that is, no heat!
Other one-to-one non-linear transformation (e.g., log, exp, sin, etc.) can
also be applied.
Note
All ratio data are interval data but the reverse is not true.
In ratio scale, both differences between data values and ratios (of non-zero) data
pairs are meaningful.
Both interval and ratio data can be stored in same data type (i.e., integer, float,
double, etc.)
Example.
Rainfall data of Metrological Department
Time (Year, Season, Month, Week, Day, etc.)
Location (Country, Region, State, etc.)
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 25
2-D view of rainfall data
DRILL DOWN
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 31 31
Data cube segregation
BASE CUBOID
SLICE
@DSamanta, IIT Kharagpur Data Analytics (CS61061) 32 32
Data representation
How a document (e.g., text) can be represented?