Lesson 2.1 - Know Your Data PDF
Lesson 2.1 - Know Your Data PDF
Anis ur Rahman
Department of Computing
NUST-SEECS
Islamabad
1 / 43
Getting To Know Your Data
Road Map
2 / 43
Getting To Know Your Data
3 / 43
Getting To Know Your Data
4 / 43
Getting To Know Your Data
5 / 43
Getting To Know Your Data
Ordered
Videos
Temporal data
Sequential data
Genetic sequence data
6 / 43
Getting To Know Your Data
7 / 43
Getting To Know Your Data
8 / 43
Getting To Know Your Data
Attribute Types
9 / 43
Getting To Know Your Data
Attributes Types
Discrete
Has only a finite or countable infinite set of values
e.g., zip codes, profession, or the set of words in a collection of
documents
Sometimes, represented as integer variables
Note. Binary attributes are a special case of discrete attributes
Continuous
Has real numbers as attribute values
e.g., temperature, height, or weight
Practically, real values can only be measured and represented
using a finite number of digits
Typically represented as floating-point variables
11 / 43
Getting To Know Your Data
Quiz
Road Map
13 / 43
Getting To Know Your Data
Motivation
For data preprocessing, an overall picture of data is essential
15 / 43
Getting To Know Your Data
Age frequency
1-5 200
6-15 450
16-20 300
21-50 1500
51-80 700
17 / 43
Getting To Know Your Data
Mode
Value that occurs most frequently in the data
It is possible that several different values have the greatest
frequency
unimodal, bimodal, trimodal, multimodal
If each data value occurs only once then there is no mode
Empirical formula:
mean − mode = 3 × (mean − median)
Midrange
Can also be used to assess the central tendency
It is the average of the smallest and the largest value of the set
It is an algebric measure that is easy to compute
18 / 43
Getting To Know Your Data
19 / 43
Getting To Know Your Data
Quiz
20 / 43
Getting To Know Your Data
Range
the distance between the largest and the smallest values
Kth percentile
value xi having the property that k% of the data lies at or below xi
the median is 50th percentile
the most popular percentiles other than the median are Quartiles
Q1 (25th percentile), Q3 (75th percentile)
Quartiles + median give some indication of the center, spread, and the
shape of a distribution
21 / 43
Getting To Know Your Data
Inter-quartile range
Distance between the first and the third quartiles
IQR = Q3 − Q1
22 / 43
Getting To Know Your Data
23 / 43
Getting To Know Your Data
Graphic Displays
24 / 43
Getting To Know Your Data
Boxplot
25 / 43
Getting To Know Your Data
Histogram Analysis
Note
If the attribute is nominal → bar chart
If the attribute is numeric → histogram
26 / 43
Getting To Know Your Data
The two histograms shown in the left may have the same boxplot
representation
The same values for: min, Q1, median, Q3, max, but rather
different data distributions
27 / 43
Getting To Know Your Data
Scatter plot
28 / 43
Getting To Know Your Data
29 / 43
Getting To Know Your Data
Road Map
30 / 43
Getting To Know Your Data
31 / 43
Getting To Know Your Data
Data matrix
x
11 ··· x1f ··· x1p
. .. ..
.. ..
n data points with p dimensions .
. . . . .
··· ···
xi 1 xif xip
Two modes. rows and columns represent
. .. ..
.. .. ..
different entities . . . .
xn1 ··· xnf ··· xnp
Dissimilarity matrix
32 / 43
Getting To Know Your Data
Nominal Attributes
33 / 43
Getting To Know Your Data
Binary Attributes
Example
35 / 43
Getting To Know Your Data
Numeric Attributes
mf : the mean of f
x1f , · · · , xnf : measurements of f
Calculate the standardized measurement, or z-score
x − mf
zif = if
sf
Using the mean absolute deviation reduces the effect of outliers
Outliers remain detectable (non squared deviation)
36 / 43
Getting To Know Your Data
37 / 43
Getting To Know Your Data
38 / 43
Getting To Know Your Data
Dissimilarity Matrices
x1 x2 x3 x4
x 0
1
Manhattan(L1 ) = x2 5 0
x3 3 6 0
x4 6 1 7 0
x1 x2 x3 x4
x 0
1
Euclidean(L2 ) = x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
39 / 43
Getting To Know Your Data
Ordinal Variables
40 / 43
Getting To Know Your Data
Vector Objects
where (·) indicates vector dot product, and ||d || is the length of vector d
41 / 43
Getting To Know Your Data
Cosine Similarity
Example. Find the similarity between documents 1 and 2
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) and d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
Compute d 1 · d 2
d 1 · d 2 = (5)(3) + 0 + (3)(2) + 0 + (2)(1) + (0)(1) + (0)(1) + (2)(1) + 0 + (0)(1)
= 25
Compute ||d1 ||
p
||d1 || = (5)(5) + 0 + (3)(3) + 0 + (2)(2) + 0 + 0 + (2)(2) + 0 + 0
√
= 42 = 6.481
Compute ||d2 ||
p
||d2 || = (3)(3) + 0 + (2)(2) + 0 + (1)(1) + (1)(1) + 0 + (1)(1) + 0 + (1)(1)
√
= 17 = 4.12
Compute cosine
d1 · d2 25
cos(d1 , d2 ) = = = 0.94
||d1 ||||d2 || 6.481 × 4.12
42 / 43
Getting To Know Your Data
Summary
Note
Above steps are the beginning of data preprocessing
Many methods have been developed but still an active area of
research.
43 / 43