Week 2 - 3getting To Know Your Data
Week 2 - 3getting To Know Your Data
Warehousing
Data Visualization
2
What is Data Sets?
Data sets are made up of data Attributes
objects
A data object represent an Tid Refund Marital Taxable
entity. Status Income Cheat
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
4
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades
5
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
Ratio
e.g. speed , counts( years_of_experience).
6
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
7
Types of data sets
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Record Data
Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
vector,
the value of each component is the number of
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
each record (transaction) involves a set of
items.
For example, consider a grocery store. The set
An element of
the sequence
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
missing values
duplicate data
Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Basic Statistical Descriptions of Data
Basic statistical descriptions :
can be used to identify properties of the data and
highlight which data values should be treated as noise or
outliers.
Measures of central tendency :
20
Basic Statistical Descriptions of Data
Use many graphic displays of basic statistical
descriptions to visually inspect our data .
Most statistical or graphical data presentation software
packages include bar charts, pie charts, and line graphs.
Other popular displays of data summaries and
distributions include quantile plots, quantile–quantile
plots, histograms, and scatter plots.
21
Measuring the Central Tendency
Mean :
Average value.
Median:
Middle value.
Mode:
Most common value .
Midrange :
The average of the largest and smallest values in the set.
This measure is easy to compute using the SQL aggregate functions, max() and
min().
22
Symmetric vs. Skewed Data
Median, mean and mode of
symmetric, positively and negatively
skewed data
symmetric
25
Measuring the Dispersion of Data
Boxplot: ends of the box are the quartiles; median is marked; add whiskers,
and plot outliers individual
27
Boxplot Analysis
28
Visualization of Data Dispersion: 3-D Boxplots
30
Histogram Analysis
31
Histograms Often Tell More than Boxplots
32
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
34
Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
35
Positively and Negatively Correlated Data
36
Uncorrelated Data
37
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
among data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
38
Geometric Projection Visualization Techniques
41
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
43
Parallel Coordinates of a Data Set
44
Icon-Based Visualization Techniques
46
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell
gender,
education, etc.
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
48
Dimensional Stacking
49
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
50
Worlds-within-Worlds
Assign the function and two most important parameters to innermost
world
Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
Software that uses this paradigm
N–vision: Dynamic
interaction through data
glove and stereo displays,
including rotation, scaling
(inner) and translation
(inner/outer)
Auto Visual: Static
interaction by means of
queries
51
Tree-Map
Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute
values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
53
Three-D Cone Trees
3D cone tree visualization technique works
well for up to a thousand nodes or so
First build a 2D circle tree that arranges its
nodes in concentric circles centered on the
root node
Cannot avoid overlaps when projected to
2D
G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
Graph from Nadeau Software Consulting
website: Visualize a social network data set
that models the way an infection spreads
from one person to the next
54
Visualizing Complex Data and Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
The importance of tag is
represented by font
size/color
Besides text data, there are
also methods to visualize
relationships, such as
visualizing social networks
57
Proximity Measure for Nominal Attributes
58
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
59
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
61
Standardizing Numeric Data
Z-score:
x
z
X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
the distance between the raw score and the population mean in units of
the standard deviation
negative when the raw score is below the mean, “+” when above
An alternative way: Calculate the mean absolute deviation
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where
m f 1n (x1 f x2 f ... xnf )
.
x m
if f
zif sf
standardized measure (z-score):
Using mean absolute deviation is more robust than using standard deviation
62
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
63
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
64
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2 ip jp
65
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
66
Videos
Box-Plot Case Study:
https://www.youtube.com/watch?v=CoVf1jLxgj4
Z-Score Case
https://www.youtube.com/watch?v=2JjaWQZChqs
Jaccard Coefficient:
https://www.youtube.com/watch?v=Vbdki_gnnYM
67