02data (Compatibility Mode)
02data (Compatibility Mode)
Data Visualization
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign Measuring Data Similarity and Dissimilarity
Simon Fraser University
Summary
©2011 Han, Kamber, and Pei. All rights reserved.
1 2
crosstabs
Document data: text documents: term-
Curse of dimensionality
frequency vector
Transaction data Sparsity
Graph and network
World Wide Web Only presence counts
Social or information networks
Molecular Structures Resolution
Ordered TID Items
5 6
1
Attribute Types Numeric Attribute Types
Nominal: categories, states, or “names of things” Quantity (integer or real-valued)
Hair_color = {auburn, black, blond, brown, grey, red, white} Interval
marital status, occupation, ID numbers, zip codes
Measured on a scale of equal-sized units
Binary
Nominal attribute with only 2 states (0 and 1) Values have order
Symmetric binary: both outcomes equally important E.g., temperature in C˚or F˚, calendar dates
e.g., gender No true zero-point
Asymmetric binary: outcomes not equally important. Ratio
e.g., medical test (positive vs. negative)
Inherent zero-point
Convention: assign 1 to most important outcome (e.g., HIV
positive) We can speak of values as being an order of
Ordinal magnitude larger than the unit of measurement
Values have a meaningful order (ranking) but magnitude between (10 K˚ is twice as high as 5 K˚).
successive values is not known. e.g., temperature in Kelvin, length, counts,
Size = {small, medium, large}, grades, army rankings monetary quantities
7 8
collection of documents
Sometimes, represented as integer variables Basic Statistical Descriptions of Data
Note: Binary attributes are a special case of discrete
attributes
Data Visualization
Continuous Attribute
Has real numbers as attribute values
2
Symmetric vs. Skewed Data Measuring the Dispersion of Data
Median, mean and mode of symmetric
Quartiles, outliers and boxplots
17 18
3
Histogram Analysis Histograms Often Tell More than Boxplots
19 20
23 24
4
Uncorrelated Data Chapter 2: Getting to Know Your Data
Data Visualization
Summary
25 26
To save space and show the connections among multiple dimensions, Visualization of geometric transformations and projections
space filling is often done in a circle segment of the data
Methods
Direct visualization
Scatterplot and scatterplot matrices
Landscapes
Projection pursuit technique: Help users find meaningful
projections of multidimensional data
Prosection views
Hyperslice
(a) Representing a data record Parallel coordinates
(b) Laying out pixels in circle segment
in circle segment
29 30
5
Direct Data Visualization Scatterplot Matrices
Ribbons with Twists Based on Vorticity
corresponding attribute
news articles
visualized as
Every data item corresponds to a polygonal line which intersects each
a landscape of the axes at the point which corresponds to the value for the
attribute
• • •
35 36
6
Chernoff Faces Stick Figure
A census data
A way to display variables on a two-dimensional surface, e.g., let x be figure showing
eyebrow slant, y be eye size, z be nose length, etc. age, income,
The figure shows faces produced using 10 characteristics--head gender,
eccentricity, eye size, eye spacing, eye eccentricity, pupil size, education, etc.
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated using
Mathematica (S. Dickson)
Methods attribute 3
N–vision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
Auto Visual: Static
Visualization of oil mining data with longitude and latitude mapped to the interaction by means of
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes queries
41 42
7
Tree-Map Tree-Map of a File System (Schneiderman)
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 43 44
A 3-D visualization technique where hierarchical 3D cone tree visualization technique works
well for up to a thousand nodes or so
information is displayed as nested semi-transparent
First build a 2D circle tree that arranges its
cubes
Visualizing Complex Data and Relations Chapter 2: Getting to Know Your Data
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
Data Objects and Attribute Types
The importance of
tag is represented
by font size/color Basic Statistical Descriptions of Data
Besides text data,
there are also Data Visualization
methods to visualize
relationships, such as
visualizing social
Measuring Data Similarity and Dissimilarity
networks
Summary
8
Similarity and Dissimilarity Data Matrix and Dissimilarity Matrix
Similarity Data matrix
Numerical measure of how alike two data objects are n data points with p x 11 ... x 1f ... x 1p
Value is higher when objects are more alike dimensions ... ... ... ... ...
x x ip
Two modes
... x if ...
Often falls in the range [0,1] i1
... ... ... ... ...
Dissimilarity (e.g., distance) x x np
... x nf ...
Numerical measure of how different two data objects
n1
are Dissimilarity matrix
0
Lower when objects are more alike n data points, but
d(2,1) 0
Minimum dissimilarity is often 0 registers only the
d(3,1 )
Upper limit varies
distance
d ( 3,2 ) 0
A triangular matrix : : :
Proximity refers to a similarity or dissimilarity d ( n ,1) d ( n ,2 ) ... ... 0
Single mode
49 50
Proximity Measure for Nominal Attributes Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Can take 2 or more states, e.g., red, yellow, blue, Object i
green (generalization of a binary attribute) Distance measure for symmetric
Method 1: Simple matching binary variables:
Distance measure for asymmetric
m: # of matches, p: total # of variables
binary variables:
d ( i , j ) p p m Jaccard coefficient (similarity
Method 2: Use a large number of binary attributes measure for asymmetric binary
variables):
creating a new binary attribute for each of the Note: Jaccard coefficient is the same as “coherence”:
M nominal states
51 52
0 1
d ( jack , mary )
2 0 1
0 . 33 zif if
sf
f
1 1
standardized measure (z-score):
d ( jack , jim ) 0 . 67
1 1 1 Using mean absolute deviation is more robust than using standard
d ( jim , mary )
1 2
0 . 75 deviation
1 1 2
53 54
9
Example:
Data Matrix and Dissimilarity Matrix Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
x2 x4
Data Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0 where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
x4 4 5 p-dimensional data objects, and h is the order (the
2 x1 distance so defined is also called L-h norm)
Dissimilarity Matrix Properties
(with Euclidean Distance)
x3 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
x1 x2 x3 x4
d(i, j) = d(j, i) (Symmetry)
0 2 4
x1 0
59 60
10
Cosine Similarity Example: Cosine Similarity
A document can be represented by thousands of attributes, each cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
recording the frequency of a particular word (such as keywords) or where indicates vector dot product, ||d|: the length of vector d
phrase in the document.
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
61 62
Summary
63 64
References
W. Cleveland, Visualizing Data, Hobart Press, 1993
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
65
11