02Data
02Data
— Chapter 2 —
n Data Visualization
n Summary
2
Types of Data Sets
n Record
n Relational records
n Data matrix, e.g., numerical matrix,
(-%&"$(
+&,+"D
#",#5
0,%&
+#"1&
(&,%
2,..
."+(
3.,
crosstabs
/-
D
4
n Document data: text documents: term-
frequency vector
!"#$%&D()* 7 8 9 8 6 : 8 6 8 6
n Transaction data
n Graph and network !"#$%&D()6 8 ; 8 6 * 8 8 7 8 8
n Dimensionality
n Curse of dimensionality
n Sparsity
n Only presence counts
n Resolution
n Patterns depend on the scale
n Distribution
n Centrality and dispersion
4
Data Objects
n Binary
n Numeric: quantitative
n Interval-scaled
n Ratio-scaled
6
Attribute Types
n Nominal: categories, states, or “names of things”
n Hair_color = {auburn, black, blond, brown, grey, red, white}
n marital status, occupation, ID numbers, zip codes
n Binary
n Nominal attribute with only 2 states (0 and 1)
n Symmetric binary: both outcomes equally important
n e.g., gender
n Asymmetric binary: outcomes not equally important.
n e.g., medical test (positive vs. negative)
n Convention: assign 1 to most important outcome (e.g., HIV
positive)
n Ordinal
n Values have a meaningful order (ranking) but magnitude between
successive values is not known.
n Size = {small, medium, large}, grades, army rankings
7
Numeric Attribute Types
n Quantity (integer or real-valued)
n Interval
n Measured on a scale of equal-sized units
n Values have order
n E.g., temperature in C˚or F˚, calendar dates
n No true zero-point
n Ratio
n Inherent zero-point
n We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
n e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous Attributes
n Discrete Attribute
n Has only a finite or countably infinite set of values
collection of documents
n Sometimes, represented as integer variables
n Data Visualization
n Summary
10
Basic Statistical Descriptions of Data
n Motivation
n To better understand the data: central tendency,
variation and spread
n Data dispersion characteristics
n median, max, min, quantiles, outliers, variance, etc.
n Numerical dimensions correspond to sorted intervals
n Data dispersion: analyzed with multiple granularities
of precision
n Boxplot or quantile analysis on sorted intervals
n Dispersion analysis on computed measures
n Folding measures into numerical dimensions
n Boxplot or quantile analysis on the transformed cube
11
Measuring the Central Tendency
n Mean (algebraic measure) (sample vs. population): ! !
# = ! #" µ= ! "
Note: n is sample size and N is population size. ! " =! !
!
Weighted arithmetic mean:
!# $
n
" "
n Trimmed mean: chopping extreme values $= " =!
!
n Median: !#
" =!
"
$
!
" =!
$ #" " µ
"
# =
$
! #" " µ "
" =!
"
14
Boxplot Analysis
15
Visualization of Data Dispersion: 3-D Boxplots
17
Graphic Displays of Basic Statistical Descriptions
18
Histogram Analysis
n Histogram: Graph display of
tabulated frequencies, shown as &!
bars %"
n It shows what proportion of cases %!
fall into each of several categories
$"
n Differs from a bar chart in that it is
$!
the area of the bar that denotes the
value, not the height as in bar #"
charts, a crucial distinction when the #!
categories are not of uniform width
"
n The categories are usually specified
!
as non-overlapping intervals of #!!!! %!!!! "!!!! '!!!! (!!!!
some variable. The categories (bars)
must be adjacent
19
Histograms Often Tell More than Boxplots
20
Quantile Plot
n Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
n Plots quantile information
n For a data xi data sorted in increasing order, fi
22
Scatter plot
n Provides a first look at bivariate data to see clusters of
points, outliers, etc
n Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
23
Positively and Negatively Correlated Data
24
Uncorrelated Data
25
Chapter 2: Getting to Know Your Data
n Data Visualization
n Summary
26
Data Visualization
n Why data visualization?
n Gain insight into an information space by mapping data onto graphical
primitives
n Provide qualitative overview of large data sets
n Search for patterns, trends, structure, irregularities, relationships among
data
n Help find interesting regions and suitable parameters for further
quantitative analysis
n Provide a visual proof of computer representations derived
n Categorization of visualization methods:
n Pixel-oriented visualization techniques
n Geometric projection visualization techniques
n Icon-based visualization techniques
n Hierarchical visualization techniques
n Visualizing complex data and relations
27
Pixel-Oriented Visualization Techniques
n For a data set of m dimensions, create m windows on the screen, one
for each dimension
n The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
n The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit (c) transaction volume (d) age
28
Laying Out Pixels in Circle Segments
n To save space and show the connections among multiple dimensions,
space filling is often done in a circle segment
32
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
)* ) *)
35
Icon-Based Visualization Techniques
36
Chernoff Faces
37
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell
gender,
education, etc.
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern 38
Hierarchical Visualization Techniques
39
Dimensional Stacking
!""#$%&"'(,
!""#$%&"'(*
!""#$%&"'(+
!""#$%&"'()
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
41
Worlds-within-Worlds
n Assign the function and two most important parameters to innermost
world
n Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
n Software that uses this paradigm
n N–vision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
n Auto Visual: Static
interaction by means of
queries
42
Tree-Map
n Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute values
n The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 43
Tree-Map of a File System (Schneiderman)
44
InfoCube
45
Three-D Cone Trees
n 3D cone tree visualization technique works
well for up to a thousand nodes or so
n First build a 2D circle tree that arranges its
nodes in concentric circles centered on the
root node
n Cannot avoid overlaps when projected to
2D
n G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
n Graph from Nadeau Software Consulting
website: Visualize a social network data set
that models the way an infection spreads
from one person to the next
Ack.: http://nadeausoftware.com/articles/visualization
46
Visualizing Complex Data and Relations
n Visualizing non-numerical data: text and social networks
n Tag cloud: visualizing user-generated tags
n The importance of
tag is represented
by font size/color
n Besides text data,
there are also
methods to visualize
relationships, such as
visualizing social
networks
n Data Visualization
n Summary
48
Similarity and Dissimilarity
n Similarity
n Numerical measure of how alike two data objects are
are
n Lower when objects are more alike
49
Data Matrix and Dissimilarity Matrix
n Data matrix
n n data points with p & #&& $$$ #&% $$$ #&" #
$ !
dimensions $ $$$ $$$ $$$ $$$ $$$ !
$# $$$ #'% $$$ #'" !
n Two modes
$ '& !
$ $$$ $$$ $$$ $$$ $$$ !
$# $$$ #!% $$$ #!" !!
$% !& "
n Dissimilarity matrix
& $ #
n n data points, but
$ #%)'(* $ !
registers only the $ !
$ #%&'(# # & )%$# $ !
distance $ !
n A triangular matrix $ ( ( ( !
$%# & "%'# # & "%$# """ !!! !!"
n Single mode
50
Proximity Measure for Nominal Attributes
51
Proximity Measure for Binary Attributes
Object j
n A contingency table for binary data
Object i
52
Dissimilarity between Binary Variables
n Example
!"#$ %$C'$( F$G$( +,J./ M$N2P4 M$N2P5 M$N2P6 M$N2PT
8"9: ; Y ! = ! ! !
;"(> F Y ! = ! = !
8?# ; Y = ! ! ! !
'"! = & #
n standardized measure (z-score): !
n Using mean absolute deviation is more robust than using standard
deviation
54
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
$" $#
!"#$% &%%'#()%*+ &%%'#()%*,
# !" ! "
!# # $
!$ " %
!% & $
" $%
Dissimilarity Matrix
(with Euclidean Distance)
$&
! " # !" !# !$ !%
!" !
!# "#$% !
!$ &#% &#% !
!% '#(' % &#") !
55
Distance on Numeric Data: Minkowski Distance
n Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
n Properties
n d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
n d(i, j) = d(j, i) (Symmetry)
n d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
n A distance that satisfies these properties is a metric
56
Special Cases of Minkowski Distance
n h = 1: Manhattan (city block, L1 norm) distance
n E.g., the Hamming distance: the number of bits that are
57
Example: Minkowski Distance
Dissimilarity Matrices
!"#$% &%%'#()%*+, &%%'#()%*+- Manhattan (L1)
., ! "
! "# "$ "% "&
.- # $ "# !
./ " % "$ " !
.0 & $ "% # $ !
"& $ % & !
Euclidean (L2)
$" $#
!" #$ #" #% #&
# #$ !
#" "#$% !
#% &#&' (#% !
#& '#&' % (#") !
" $%
Supremum
!! "# "$ "% "&
"# !
"$ " !
$& "% # $ !
! " # "& " % $ !
58
Ordinal Variables
59
Attributes of Mixed Type
60
Cosine Similarity
n A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
61
Example: Cosine Similarity
n cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
62
Chapter 2: Getting to Know Your Data
n Data Visualization
n Summary
63
Summary
n Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
n Many types of data sets, e.g., numerical, text, graph, Web, image.
n Gain insight into the data by:
n Basic statistical data description: central tendency, dispersion,
graphical displays
n Data visualization: map data onto graphical primitives
n Measure data similarity
n Above steps are the beginning of data preprocessing.
n Many methods have been developed but still an active area of research.
64
References
n W. Cleveland, Visualizing Data, Hobart Press, 1993
n T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
n U. Fayyad, G. Grinstein, and A. Wierse. InformaMon VisualizaMon in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
n L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an IntroducMon to Cluster
Analysis. John Wiley & Sons, 1990.
n H. V. Jagadish, et al., Special Issue on Data ReducMon Techniques. BulleMn of the Tech.
CommiXee on Data Eng., 20(4), Dec. 1997
n D. A. Keim. InformaMon visualizaMon and visual data mining, IEEE trans. on
VisualizaMon and Computer Graphics, 8(1), 2002
n D. Pyle. Data PreparaMon for Data Mining. Morgan Kaufmann, 1999
n S. SanMni and R. Jain,” Similarity measures”, IEEE Trans. on PaXern Analysis and
Machine Intelligence, 21(9), 1999
n E. R. Tu_e. The Visual Display of QuanMtaMve InformaMon, 2nd ed., Graphics Press,
2001
n C. Yu , et al., Visual data mining of mulMmedia data for social and behavioral studies,
InformaMon VisualizaMon, 8(1), 2009
65