02 Data
02 Data
— Chapter 2 —
1
Chapter 2: Getting to Know Your Data
n Data Visualization
n Summary
2
Types of Data Sets
n Record
n Relational records
n Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
n Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
n Transaction data
n Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
n Dimensionality
n Curse
of
dimensionality
n Sparsity
n Only
presence
counts
n Resolution
n Patterns
depend
on
the
scale
n Distribution
n Centrality
and
dispersion
4
Data Objects
n Types:
n Nominal
n Binary
n Numeric: quantitative
n Interval-‐scaled
n Ratio-‐scaled
6
Attribute Types
n Nominal: categories,
states,
or
“names
of
things”
n Hair_color
=
{auburn,
black,
blond,
brown,
grey,
red,
white}
n marital
status,
occupation,
ID
numbers,
zip
codes
n Binary
n Nominal
attribute
with
only
2
states
(0
and
1)
n Symmetric
binary:
both
outcomes
equally
important
n e.g.,
gender
n Asymmetric
binary:
outcomes
not
equally
important.
n e.g.,
medical
test
(positive
vs.
negative)
n Convention:
assign
1
to
most
important
outcome
(e.g.,
HIV
positive)
n Ordinal
n Values
have
a
meaningful
order
(ranking)
but
magnitude
between
successive
values
is
not
known.
n Size
=
{small,
medium,
large}, grades,
army
rankings
7
Numeric Attribute Types
n Quantity
(integer
or
real-‐valued)
n Interval
n Measured
on
a
scale
of
equal-‐sized
units
n Values
have
order
n E.g.,
temperature
in
C˚or
F˚,
calendar
dates
n No
true
zero-‐point
n Ratio
n Inherent
zero-‐point
n We
can
speak
of
values
as
being
an
order
of
magnitude
larger
than
the
unit
of
measurement
(10
K˚ is
twice
as
high
as
5
K˚).
n e.g.,
temperature
in
Kelvin,
length,
counts,
monetary
quantities
8
Discrete vs. Continuous Attributes
n Discrete Attribute
n Has
only
a
finite
or
c ountably
infinite
set
of
values
n E.g., zip c odes, profession, or the set of words in a
n Data Visualization
n Summary
10
Basic Statistical Descriptions of Data
n Motivation
n To
better
understand
the
data:
central
tendency,
variation
and
spread
n Data
dispersion
characteristics
n median,
max,
min,
quantiles,
outliers,
variance,
etc.
n Numerical
dimensions correspond
to
sorted
intervals
n Data
dispersion:
analyzed
with
multiple
granularities
of
precision
n Boxplot
or
quantile
analysis
on
sorted
intervals
n Dispersion
analysis
on
computed
measures
n Folding
measures
into
numerical
dimensions
n Boxplot
or
quantile
analysis
on
the
transformed
cube
11
Measuring the Central Tendency
n Mean
(algebraic
measure)
(sample
vs.
population): 1 n ∑ x
x = ∑ xi µ=
Note:
n is
sample
size
and
N is
population
size.
n i =1 N
n
n Weighted
arithmetic
mean:
∑w x i i
n Trimmed
mean:
chopping
extreme
values x= i =1
n
n Median:
∑w i
i =1
n Middle
value
if
odd
number
of
values,
or
average
of
the
middle
two
values
otherwise
n Estimated
by
interpolation
(for
grouped
data):
n / 2 − (∑ freq )l Median
median = L1 + ( ) width interval
n Standard
deviation s
(or
σ)
is
the
square
root
of
variance
s2
(or σ2)
14
Boxplot Analysis
15
Visualization of Data Dispersion: 3-D Boxplots
17
Graphic Displays of Basic Statistical Descriptions
18
Histogram Analysis
n Histogram:
Graph
display
of
tabulated
40
frequencies,
shown
as
bars
n It
shows
what
proportion
of
cases
fall
35
into
each
of
several
categories 30
20
Quantile Plot
n Displays
all
of
the
data
(allowing
the
user
to
assess
both
the
overall
behavior
and
unusual
occurrences)
n Plots
quantile information
n For
a
data
xi data
sorted
in
increasing
order,
fi indicates
that
approximately
100
fi%
of
the
data
are
below
or
equal
to
the
value
xi
22
Scatter plot
n Provides
a
first
look
at
bivariate
data
to
see
clusters
of
points,
outliers,
etc
n Each
pair
of
values
is
treated
as
a
pair
of
coordinates
and
plotted
as
points
in
the
plane
23
Positively and Negatively Correlated Data
24
Uncorrelated Data
25
Chapter 2: Getting to Know Your Data
n Data Visualization
n Summary
26
Data Visualization
n Why
data
visualization?
n Gain
insight into
an
information
space
by
mapping
data
onto
graphical
primitives
n Provide
qualitative
overview of
large
data
sets
data
n Help
find
interesting
regions
and
suitable
parameters for
further
quantitative
analysis
n Provide
a
visual
proof of
computer
representations
derived
(a) Income (b) Credit Limit (c) transaction volume (d) age
28
Laying Out Pixels in Circle Segments
n To save space and show the connections among multiple dimensions,
space filling is often done in a circle segment
Matrix of scatterplots (x-‐y-‐diagrams) of the k-‐dim. data [total of (k2/2-‐k) scatterplots]
32
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
• • •
35
Icon-Based Visualization Techniques
36
Chernoff Faces
37
Stick Figure
A
census
data
figure
showing
age,
income,
gender,
education,
etc.
A
5-‐piece
stick
figure
(1
body
and
4
limbs
w.
different
angle/length)
39
Dimensional Stacking
Visualization
of
oil
mining
data
with
longitude
and
latitude
mapped
to
the
outer
x-‐,
y-‐axes
and
ore
grade
and
depth
mapped
to
the
inner
x-‐,
y-‐axes
41
Worlds-within-Worlds
n Assign
the
function
and
two
most
important
parameters
to
innermost
world
n Fix
all
other
parameters
at
constant
values
-‐ draw
other
(1
or
2
or
3
dimensional
worlds
choosing
these
as
the
axes)
n Software
that
uses
this
paradigm
n N–vision:
Dynamic
interaction
through
data
glove
and
stereo
displays,
including
rotation,
scaling
(inner)
and
translation
(inner/outer)
n Auto
Visual:
Static
interaction
by
means
of
queries
42
Tree-Map
n Screen-‐filling
method
which
uses
a
hierarchical
partitioning
of
the
screen
into
regions
depending
on
the
attribute
values
n The
x-‐ and
y-‐dimension
of
the
screen
are
partitioned
alternately
according
to
the
attribute
values
(classes)
44
Three-D Cone Trees
n 3D cone
tree visualization
technique
works
well
for
up
to
a
thousand
nodes
or
so
n First
build
a
2D circle
tree that
arranges
its
nodes
in
concentric
circles
centered
on
the
root
node
n Cannot
avoid
overlaps
when
projected
to
2D
n G.
Robertson,
J.
Mackinlay,
S.
Card.
“Cone
Trees:
Animated
3D
Visualizations
of
Hierarchical
Information”,
ACM
SIGCHI'91
n Graph
from
Nadeau
Software
Consulting
website:
Visualize
a
social
network
data
set
that
models
the
way
an
infection
spreads
from
one
person
to
the
next
45
Visualizing Complex Data and Relations
n Visualizing
non-‐numerical
data:
text
and
social
networks
n Tag
cloud:
visualizing
user-‐generated
tags
n The
importance
of
tag
is
n Data Visualization
n Summary
47
Similarity and Dissimilarity
n Similarity
n Numerical
measure
of
how
alike
two
data
objects
are
n Value
is
higher
when
objects
are
more
alike
n Often
falls
in
the
range
[0,1]
n Dissimilarity (e.g.,
distance)
n Numerical
measure
of
how
different
two
data
objects
are
n Lower
when
objects
are
more
alike
n Minimum
dissimilarity
is
often
0
n Upper
limit
varies
n Proximity refers
to
a
similarity
or
dissimilarity
48
Data Matrix and Dissimilarity Matrix
n Data
matrix
n n
data
points
with
p
⎡ x11 ... x1f ... x1p ⎤
⎢ ⎥
dimensions ⎢ ... ... ... ... ... ⎥
n Two
modes
⎢ x ... xif ... xip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢ x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦
n Dissimilarity
matrix
⎡ 0 ⎤
n n
data
points,
but
⎢ d(2,1) 0 ⎥
registers
only
the
⎢ ⎥
distance
⎢ d(3,1) d ( 3,2) 0 ⎥
⎢ ⎥
n A
triangular
matrix ⎢ : : : ⎥
⎢⎣d ( n,1) d ( n,2) ... ... 0⎥⎦
n Single
mode
49
Proximity Measure for Nominal Attributes
d (i, j) = p p − m
50
Proximity Measure for Binary Attributes
Object j
n A
contingency
table
for
binary
data
Object i
51
Dissimilarity between Binary Variables
n Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
n Z-‐score:
x
z= σ − µ
n X:
raw
score
to
be
standardized,
μ:
mean
of
the
population,
σ:
standard
deviation
n the
distance
between
the
raw
score
and
the
population
mean
in
units
of
the
standard
deviation
n negative
when
the
raw
score
is
below
the
mean,
“+”
when
above
n An
alternative
way:
Calculate
the
mean
absolute
deviation
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where
m f = 1n (x1 f + x2 f + ... + xnf ) .
x −m if f
n standardized
measure
(z-‐score): zif = sf
n Using
mean
absolute
deviation
is
more
robust
than
using
standard
deviation
53
Example:
Data Matrix and Dissimilarity Matrix
x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
54
Distance on Numeric Data: Minkowski Distance
n Minkowski
distance:
A
popular
distance
measure
where
i =
(xi1,
xi2,
…,
xip)
and j =
(xj1,
xj2,
…,
xjp)
are
two
p-‐
dimensional
data
objects,
and
h is
the
order
(the
distance
so
defined
is
also
called
L-‐h norm)
n Properties
n d(i,
j)
>
0
if
i
≠
j,
and
d(i,
i)
=
0
(Positive
definiteness)
n d(i,
j)
=
d(j,
i) (Symmetry)
n d(i,
j)
≤ d(i,
k)
+
d(k,
j) (Triangle
Inequality)
n A
distance
that
satisfies
these
properties
is
a
metric
55
Special Cases of Minkowski Distance
n h =
1:
Manhattan (city
block,
L1 norm) distance
n E.g.,
the
Hamming
distance:
the
number
of
bits
that
are
different
of the vectors
56
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
57
Ordinal Variables
58
Attributes of Mixed Type
n A
database
may
contain
all
attribute
types
n Nominal,
symmetric
binary,
asymmetric
binary,
numeric,
ordinal
n One
may
use
a
weighted
formula
to
combine
their
effects
Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )
n f is
binary
or
nominal:
dij(f) =
0
if
xif
=
xjf ,
or
dij(f) =
1
otherwise
n f is
numeric:
use
the
normalized
distance
n f is
ordinal
n Compute
ranks
rif and
zif = rif − 1
n Treat
zif as
interval-‐scaled M f −1
59
Cosine Similarity
n A
document can
be
represented
by
thousands
of
attributes,
each
recording
the
frequency of
a
particular
word
(such
as
keywords)
or
phrase
in
the
document.
60
Example: Cosine Similarity
n cos(d1,
d2)
=
(d1 • d2)
/||d1||
||d2||
,
where • indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
61
Chapter 2: Getting to Know Your Data
n Data Visualization
n Summary
62
Summary
n Data
attribute
types:
nominal,
binary,
ordinal,
interval-‐scaled,
ratio-‐scaled
n Many
types
of
data
sets,
e.g.,
numerical,
text,
graph,
Web,
image.
n Gain
insight
into
the
data
by:
n Basic
statistical
data
description:
central
tendency,