0% found this document useful (0 votes)
12 views

02 Data

This document provides an overview of key concepts in data mining and data analysis. It discusses different types of data sets and attributes, including attribute types like nominal, binary, numeric, discrete vs continuous. It also covers basic statistical descriptions of data, including measuring central tendency, dispersion characteristics, and data visualization. The goal is to help readers get familiar with their data through understanding objects, attributes, and basic statistical analysis.

Uploaded by

Abood Fazil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

02 Data

This document provides an overview of key concepts in data mining and data analysis. It discusses different types of data sets and attributes, including attribute types like nominal, binary, numeric, discrete vs continuous. It also covers basic statistical descriptions of data, including measuring central tendency, dispersion characteristics, and data visualization. The goal is to help readers get familiar with their data through understanding objects, attributes, and basic statistical analysis.

Uploaded by

Abood Fazil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Data Mining:

Concepts and Techniques

— Chapter 2 —

Slides  Curtesy  of  Textbook

1
Chapter 2: Getting to Know Your Data

n Data  Objects  and  Attribute  Types

n Basic  Statistical  Descriptions  of  Data

n Data  Visualization

n Measuring  Data   Similarity  and  Dissimilarity

n Summary

2
Types of Data Sets
n Record
n Relational  records
n Data  matrix,  e.g.,  numerical  matrix,  

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
n Document  data:  text  documents:  term-­
frequency  vector
Document  1 3 0 5 0 2 6 0 2 0 2
n Transaction  data
n Graph  and  network Document  2 0 7 0 2 1 0 0 3 0 0

n World  Wide  Web Document  3 0 1 0 0 1 2 2 0 3 0


n Social  or  information  networks
n Molecular  Structures
n Ordered TID Items
n Video  data:  sequence  of  images 1 Bread, Coke, Milk
n Temporal  data:  time-­series
2 Beer, Bread
n Sequential  Data:  transaction  sequences
3 Beer, Coke, Diaper, Milk
n Genetic  sequence  data
n Spatial,  image  and  multimedia: 4 Beer, Bread, Diaper, Milk
n Spatial  data:  maps 5 Coke, Diaper, Milk
n Image  data:  
n Video  data:
3
Important Characteristics of Structured Data

n Dimensionality
n Curse  of  dimensionality
n Sparsity
n Only  presence  counts
n Resolution
n Patterns  depend  on  the  scale
n Distribution
n Centrality  and  dispersion

4
Data Objects

n Data  sets  are  made  up  of  data  objects.


n A  data  object represents  an  entity.
n Examples:  
n sales  database:    customers,  store  items,  sales
n medical  database:  patients,  treatments
n university  database:  students,   professors,  courses
n Also  called  samples  ,  examples,  instances,  data  points,  objects,  
tuples.
n Data  objects  are  described  by  attributes.
n Database  rows  -­‐>  data  objects;  columns  -­‐>attributes.
5
Attributes
n Attribute  (or dimensions,  features,  variables):  a  data  
field,  representing  a  characteristic  or  feature  of  a  data  
object.
n E.g.,  customer  _ID,  name,  address

n Types:
n Nominal

n Binary

n Numeric:  quantitative

n Interval-­‐scaled

n Ratio-­‐scaled

6
Attribute Types
n Nominal: categories,   states,   or  “names  of  things”
n Hair_color  =  {auburn,  black,  blond,  brown,  grey,  red,  white}
n marital  status,  occupation,  ID  numbers,  zip  codes
n Binary
n Nominal  attribute   with  only  2  states   (0  and  1)
n Symmetric  binary:  both  outcomes  equally  important
n e.g.,  gender
n Asymmetric  binary:  outcomes  not  equally  important.    
n e.g.,  medical  test  (positive   vs.  negative)
n Convention:  assign  1  to  most  important  outcome  (e.g.,   HIV  
positive)
n Ordinal
n Values  have  a  meaningful  order   (ranking)  but  magnitude  between  
successive   values  is  not  known.
n Size  =  {small,  medium,  large}, grades,   army  rankings

7
Numeric Attribute Types
n Quantity  (integer  or  real-­‐valued)
n Interval
n Measured  on  a  scale  of  equal-­‐sized  units
n Values  have  order
n E.g.,  temperature  in  C˚or  F˚,  calendar  dates
n No  true  zero-­‐point
n Ratio
n Inherent   zero-­‐point
n We  can  speak  of  values  as  being  an  order  of  magnitude  
larger  than  the  unit  of  measurement   (10  K˚ is  twice  as  
high  as  5  K˚).
n e.g.,  temperature  in  Kelvin,  length,  counts,  
monetary  quantities
8
Discrete vs. Continuous Attributes
n Discrete Attribute
n Has  only  a  finite  or  c ountably  infinite  set  of  values

n E.g.,  zip  c odes,  profession,  or  the  set  of  words  in  a  

collection  of  documents  


n Sometimes,   represented   as  integer  variables

n Note:  Binary  attributes  are  a  special  c ase  of  discrete  


attributes  
n Continuous Attribute
n Has  real  numbers  as  attribute  values

n E.g.,  temperature,   height,  or  weight

n Practically,  real  values  can  only  be  measured  and  


represented   using  a  finite  number  of  digits
n Continuous  attributes  are  typically  represented   as  floating-­‐
point  variables
9
Chapter 2: Getting to Know Your Data

n Data  Objects  and  Attribute  Types

n Basic  Statistical  Descriptions  of  Data

n Data  Visualization

n Measuring  Data   Similarity  and  Dissimilarity

n Summary

10
Basic Statistical Descriptions of Data
n Motivation
n To  better  understand   the  data:  central  tendency,   variation  
and  spread
n Data  dispersion  characteristics
n median,  max,  min,  quantiles,  outliers,  variance,  etc.
n Numerical  dimensions correspond  to  sorted  intervals
n Data  dispersion:  analyzed  with  multiple  granularities  of  
precision
n Boxplot  or  quantile  analysis  on  sorted  intervals
n Dispersion  analysis  on  computed  measures
n Folding  measures  into  numerical  dimensions
n Boxplot  or  quantile  analysis  on  the  transformed  cube
11
Measuring the Central Tendency
n Mean   (algebraic   measure)  (sample  vs.  population): 1 n ∑ x
x = ∑ xi µ=
Note:  n is   sample  size   and  N is   population  size.   n i =1 N
n
n Weighted  arithmetic  mean:
∑w x i i
n Trimmed   mean:  chopping   extreme  values x= i =1
n
n Median:   ∑w i
i =1
n Middle  value   if  odd   number  of   values,   or  average  of   the  
middle  two  values   otherwise
n Estimated   by  interpolation  (for  grouped  data):
n / 2 − (∑ freq )l Median  
median = L1 + ( ) width interval

n Mode freq median


n Value   that   occurs   most   frequently   in   the   data
n Unimodal,   bimodal,   trimodal
n Empirical  formula: mean − mode = 3 × (mean − median )
12
Symmetric vs. Skewed Data
n Median,  mean  and  mode  of   symmetric
symmetric,  positively  and  negatively  
skewed  data

positively  skewed negatively  skewed

August  26,  2015 Data  Mining:  Concepts  and  Techniques 13


Measuring the Dispersion of Data
n Quartiles,  outliers  and  boxplots
n Quartiles:  Q1 (25th percentile),  Q3 (75th percentile)
n Inter-­‐quartile  range:  IQR  =  Q3  – Q1  
n Five  number  summary:  min,  Q1,  median,Q3,  max
n Boxplot:  ends  of  the  box  are  the  quartiles;  median  is  marked;  add  whiskers,  
and  plot  outliers  individually
n Outlier:  usually,  a  value  higher/lower  than  1.5  x  IQR
n Variance  and  standard  deviation  (sample: s,  population:  σ)
n Variance:  (algebraic,  scalable  computation)
21 n 2 1 n 2 1 n 2 1 n
1 n
s = ∑ ( xi − x ) = [∑ xi − (∑ xi ) ] σ =2
∑ ( x − µ ) =2 2
∑ xi − µ 2
n − 1 i =1 n − 1 i =1 n i =1 N i =1
i
N i =1

n Standard  deviation s  (or  σ)  is  the  square  root  of  variance  s2  (or σ2)
14
Boxplot Analysis

n Five-­‐number  summary of  a  distribution


n Minimum,  Q1,  Median,  Q3,  Maximum
n Boxplot
n Data  is  represented  with  a  box
n The  ends  of  the  box  are  at  the  first  and  third  
quartiles,  i.e.,  the  height  of  the  box  is  IQR
n The  median  is  marked  by  a  line  within  the  box
n Whiskers:  two  lines  outside  the  box  extended  
to  Minimum  and  Maximum
n Outliers:  points  beyond  a  specified  outlier  
threshold,  plotted  individually

15
Visualization of Data Dispersion: 3-D Boxplots

August  26,  2015 Data  Mining:  Concepts  and  Techniques 16


Properties of Normal Distribution Curve

n The  normal  (distribution)  curve


n From  μ–σ to  μ+σ:  c ontains  about  68%  of  the  measurements    

(μ:  mean,  σ:  standard  deviation)


n From  μ–2σ to  μ+2σ:  contains  about  95%  of  it

n From  μ–3σ to  μ+3σ:  c ontains  about  99.7%  of  it

17
Graphic Displays of Basic Statistical Descriptions

n Boxplot:  graphic  display  of  five-­‐number  summary


n Histogram:  x-­‐axis  are  values,  y-­‐axis  repres.  frequencies  
n Quantile  plot:    each  value  xi is  paired  with  fi   indicating  that  
approximately  100  fi  %  of  data    are  ≤ xi
n Quantile-­‐quantile   (q-­‐q)  plot:  graphs  the  quantiles  of  one  
univariant  distribution  against  the  corresponding  quantiles  of  
another
n Scatter  plot:  each  pair  of  values  is  a  pair  of  coordinates  and  
plotted  as  points  in  the  plane

18
Histogram Analysis
n Histogram:  Graph  display  of   tabulated  
40
frequencies,   shown   as   bars
n It  shows   what   proportion  of   cases   fall  35
into   each   of  several   categories 30

n Differs  from   a   bar  chart  in   that  it  is   25


the   area of   the  bar   that  denotes  the   20
value,  not  the   height   as   in  bar   charts,  15
a  crucial   distinction   when  the   10
categories  are   not  of   uniform  width
5
n The   categories  are   usually  specified  as  
0
non-­‐overlapping   intervals  of   some   10000 30000 50000 70000 90000

variable.  The   categories  (bars)   must  


be   adjacent
19
Histograms Often Tell More than Boxplots

n The  two  histograms  


shown  in  the  left  may  
have  the  same  boxplot  
representation
n The  same  values  for:  
min,  Q1,  median,  Q3,  
max
n But  they  have  rather  
different  data  
distributions

20
Quantile Plot
n Displays  all  of  the  data  (allowing  the  user  to  assess  both  the  
overall  behavior  and  unusual  occurrences)
n Plots  quantile information
n For  a  data  xi data  sorted  in  increasing  order,  fi indicates  that  
approximately  100  fi%  of  the  data  are  below  or  equal  to  the  
value  xi

Data  Mining:  Concepts  and  Techniques 21


Quantile-Quantile (Q-Q) Plot
n Graphs  the   quantiles  of   one   univariate  distribution  against   the  
corresponding  quantiles  of   another
n View:  Is  there  is   a  shift   in  going   from   one  distribution   to   another?
n Example  shows  unit   price   of  items  sold   at  Branch  1  vs.   Branch  2  for  
each   quantile.     Unit  prices   of   items  sold   at  Branch  1  tend   to  be   lower  
than  those  at  Branch  2.

22
Scatter plot
n Provides  a  first  look  at  bivariate  data  to  see  clusters  of  points,  
outliers,  etc
n Each  pair  of  values  is  treated  as  a  pair  of  coordinates  and  
plotted  as  points  in  the  plane

23
Positively and Negatively Correlated Data

n The   left   half   fragment   is   positively  


correlated
n The   right   half  is   negative  correlated

24
Uncorrelated Data

25
Chapter 2: Getting to Know Your Data

n Data  Objects  and  Attribute  Types

n Basic  Statistical  Descriptions  of  Data

n Data  Visualization

n Measuring  Data   Similarity  and  Dissimilarity

n Summary

26
Data Visualization
n Why  data  visualization?
n Gain  insight into  an  information  space  by  mapping  data  onto  graphical  

primitives
n Provide  qualitative  overview of  large  data  sets

n Search for  patterns,   trends,   structure,   irregularities,   relationships   among  

data
n Help  find  interesting  regions  and  suitable  parameters for  further  

quantitative   analysis
n Provide  a  visual  proof of  computer  representations   derived

n Categorization   of  visualization  methods:


n Pixel-­‐oriented   visualization  techniques

n Geometric   projection  visualization  techniques

n Icon-­‐based   visualization  techniques

n Hierarchical  visualization  techniques

n Visualizing  complex  data  and  relations


27
Pixel-Oriented Visualization Techniques
n For  a   data  set   of  m   dimensions,   create   m   windows   on  the   screen,   one  
for   each   dimension
n The   m   dimension   values  of   a  record   are   mapped  to   m  pixels  at  the  
corresponding  positions  in   the  windows
n The   colors   of   the   pixels  reflect   the   corresponding  values

(a) Income (b)  Credit  Limit (c)  transaction  volume (d)  age
28
Laying Out Pixels in Circle Segments
n To  save  space  and  show  the  connections  among  multiple  dimensions,  
space  filling  is  often  done  in  a  circle  segment

(a) Representing  a  data  record  


(b)  Laying  out  pixels  in  circle  segment
in  circle  segment
Representing  about  265,000  50-­‐dimensional  Data  Items  
with  the  ‘Circle  Segments’  Technique 29
Geometric Projection Visualization Techniques

n Visualization  of  geometric  transformations  and  projections  of  


the  data
n Methods
n Direct  visualization
n Scatterplot  and  scatterplot  matrices
n Landscapes
n Projection  pursuit  technique:  Help  users  find  meaningful  
projections  of  multidimensional  data
n Prosection  views
n Hyperslice
n Parallel  coordinates
30
Direct Data Visualization
Ribbons with Twists Based on Vorticity

Data  Mining:  Concepts  and  Techniques 31


Scatterplot  Matrices
Institute
Used   by ermission   of   M.  Ward,   Worcester   Polytechnic

Matrix  of  scatterplots  (x-­‐y-­‐diagrams)  of  the  k-­‐dim.  data  [total  of  (k2/2-­‐k)  scatterplots]

32
Landscapes
Used  by  permission  of  B.  Wright,  Visible  Decisions  Inc.

news  articles
visualized  as
a  landscape

n Visualization  of  the  data  as  perspective   landscape


n The  data  needs  to  be  transformed   into  a  (possibly  artificial)  2D  spatial  
representation   which  preserves   the  characteristics   of  the  data  
33
Parallel Coordinates
n n  equidistant  axes  which  are  parallel  to  one  of  the  screen   axes  and  
correspond   to  the  attributes  
n The  axes  are  scaled  to  the  [minimum,  maximum]:  range  of  the  
corresponding   attribute
n Every  data  item  corresponds   to  a  polygonal  line  which  intersects   each  of  the  
axes  at  the  point  which  corresponds   to  the  value  for  the  attribute

• • •

Attr. 1 Attr. 2 Attr. 3 Attr. k


34
Parallel Coordinates of a Data Set

35
Icon-Based Visualization Techniques

n Visualization  of  the  data  values  as  features  of  icons


n Typical  visualization  methods
n Chernoff  Faces
n Stick  Figures
n General  techniques
n Shape  coding:  Use  shape  to  represent   certain  information  
encoding
n Color  icons:  Use  color  icons  to  encode  more  information
n Tile  bars:  Use  small  icons  to  represent  the  relevant  feature  
vectors  in  document  retrieval

36
Chernoff Faces

n A  way  to  display  variables   on  a  two-­‐dimensional  surface,   e.g.,  let  x  be  


eyebrow  slant,  y  be  eye  size,  z  be  nose  length,  etc.  
n The  figure  shows  faces  produced   using  10  characteristics-­‐-­‐head   eccentricity,  
eye  size,  eye  spacing,  eye  eccentricity,  pupil  size,  eyebrow  slant,  nose  size,  
mouth  shape,  mouth  size,  and  mouth  opening):  Each  assigned  one  of  10  
possible  values,  generated   using  Mathematica (S.  Dickson)

n REFERENCE:  Gonick,  L.  and  Smith,  W.  The  


Cartoon  Guide  to  Statistics. New  York:  Harper  
Perennial,   p.  212,  1993
n Weisstein,  Eric  W.  "Chernoff   Face."  From  
MathWorld-­‐-­‐A  Wolfram  Web  Resource.  
mathworld.wolfram.com/ChernoffFace.html

37
Stick Figure
A  census  data  
figure  showing  
age,  income,  
gender,  
education,  etc.

A  5-­‐piece  stick  
figure  (1  body  
and  4  limbs  w.  
different  
angle/length)

Data  Mining:  Concepts  and  Techniques 38


Hierarchical Visualization Techniques

n Visualization  of  the  data  using  a  hierarchical  


partitioning  into  subspaces
n Methods
n Dimensional  Stacking
n Worlds-­‐within-­‐Worlds
n Tree-­‐Map  
n Cone  Trees
n InfoCube

39
Dimensional Stacking

n Partitioning  of  the  n-­‐dimensional  attribute  space  in  2-­‐D  


subspaces,  which  are  ‘stacked’  into  each  other
n Partitioning  of  the  attribute  value  ranges  into  classes.    The  
important  attributes  should  be  used  on  the  outer  levels.
n Adequate   for  data  with  ordinal  attributes  of  low  cardinality
n But,  difficult  to  display  more  than  nine  dimensions
n Important  to  map  dimensions  appropriately
40
Dimensional Stacking
Used  by  permission  of  M.  Ward,  Worcester  Polytechnic  Institute

Visualization  of  oil  mining  data  with  longitude  and  latitude  mapped  to  the  
outer  x-­‐,  y-­‐axes  and  ore  grade  and  depth  mapped  to  the  inner  x-­‐,  y-­‐axes
41
Worlds-within-Worlds
n Assign  the  function  and  two  most  important  parameters  to  innermost  
world  
n Fix  all  other  parameters  at  constant  values  -­‐ draw  other  (1  or  2  or  3  
dimensional  worlds  choosing  these  as  the  axes)
n Software  that  uses  this  paradigm

n N–vision:  Dynamic  
interaction  through  data  
glove  and  stereo   displays,  
including    rotation,  scaling  
(inner)   and  translation  
(inner/outer)  
n Auto  Visual:  Static  
interaction  by  means  of  
queries

42
Tree-Map
n Screen-­‐filling  method  which  uses  a  hierarchical  partitioning  of  
the  screen  into  regions  depending  on  the  attribute  values
n The  x-­‐ and  y-­‐dimension  of  the  screen  are  partitioned  alternately  
according  to  the  attribute  values  (classes)

Schneiderman@UMD:  Tree-­Map  of  a  File  System Schneiderman@UMD:  Tree-­Map  to  support  


large  data  sets  of  a  million  items  
43
InfoCube
n A  3-­‐D  visualization  technique  where  hierarchical  
information  is  displayed  as  nested  semi-­‐transparent  
cubes  
n The  outermost  cubes  correspond  to  the  top  level  data,  
while  the  subnodes  or  the  lower  level  data  are  
represented  as  smaller  cubes  inside  the  outermost  
cubes,  and  so  on

44
Three-D Cone Trees
n 3D cone  tree visualization  technique  works  
well  for  up  to  a  thousand  nodes  or  so
n First  build  a  2D circle  tree that  arranges   its  
nodes  in  concentric  circles  centered   on  the  
root  node
n Cannot  avoid  overlaps   when  projected   to  2D  
n G.  Robertson,   J.  Mackinlay,  S.  Card.  “Cone  
Trees:   Animated  3D  Visualizations  of  
Hierarchical  Information”,   ACM  SIGCHI'91
n Graph  from  Nadeau  Software  Consulting  
website:  Visualize  a  social  network  data  set  
that  models  the  way  an  infection  spreads  from  
one  person   to  the  next  

45
Visualizing Complex Data and Relations
n Visualizing  non-­‐numerical  data:  text  and  social  networks
n Tag  cloud:  visualizing  user-­‐generated   tags
n The  importance  of  tag  is  

represented  by  font  


size/color
n Besides  text  data,  there  are  
also  methods  to  visualize  
relationships,  such  as  
visualizing  social  networks

Newsmap:  Google  News  Stories  in  2005


Chapter 2: Getting to Know Your Data

n Data  Objects  and  Attribute  Types

n Basic  Statistical  Descriptions  of  Data

n Data  Visualization

n Measuring  Data   Similarity  and  Dissimilarity

n Summary

47
Similarity and Dissimilarity
n Similarity
n Numerical  measure  of  how  alike  two  data  objects  are
n Value  is  higher  when  objects  are  more  alike
n Often  falls  in  the  range  [0,1]
n Dissimilarity (e.g.,  distance)
n Numerical  measure  of  how  different  two  data  objects  are
n Lower  when  objects  are  more  alike
n Minimum  dissimilarity  is  often  0
n Upper  limit  varies
n Proximity refers  to  a  similarity  or  dissimilarity
48
Data Matrix and Dissimilarity Matrix
n Data  matrix
n n  data  points  with  p   ⎡ x11 ... x1f ... x1p ⎤
⎢ ⎥
dimensions ⎢ ... ... ... ... ... ⎥
n Two  modes
⎢ x ... xif ... xip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢ x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦
n Dissimilarity  matrix
⎡ 0 ⎤
n n  data  points,  but  
⎢ d(2,1) 0 ⎥
registers  only  the   ⎢ ⎥
distance   ⎢ d(3,1) d ( 3,2) 0 ⎥
⎢ ⎥
n A  triangular  matrix ⎢ : : : ⎥
⎢⎣d ( n,1) d ( n,2) ... ... 0⎥⎦
n Single  mode

49
Proximity Measure for Nominal Attributes

n Can  take  2  or  more  states,  e.g.,  red,  yellow,  blue,  


green  (generalization  of  a  binary  attribute)
n Method  1:  Simple  matching
m:  #  of  matches, p:  total  #  of  variables
n

d (i, j) = p p − m

n Method  2:  Use  a  large  number  of  binary  attributes


n creating  a  new  binary  attribute  for  each  of  the  M
nominal  states

50
Proximity Measure for Binary Attributes
Object  j
n A  contingency  table  for  binary  data
Object  i

n Distance  measure  for  symmetric  


binary  variables:  
n Distance  measure  for  asymmetric  
binary  variables:  
n Jaccard  coefficient   (similarity
measure   for  asymmetric  binary  
variables):  
n Note:  Jaccard  coefficient  is  the  same  as  “coherence”:

51
Dissimilarity between Binary Variables
n Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

n Gender   is  a  symmetric  attribute


n The  remaining  attributes   are  asymmetric  binary
n Let  the  values  Y  and  P  be  1,  and  the  value  N  0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
52
Standardizing Numeric Data

n Z-­‐score:  
x
z= σ − µ
n X:  raw  score  to  be  standardized,   μ:  mean  of  the  population,  σ:  standard  
deviation
n the  distance  between   the  raw  score  and  the  population  mean  in  units  
of  the  standard  deviation
n negative   when  the  raw  score  is  below  the  mean,  “+”  when  above
n An  alternative   way:  Calculate  the  mean  absolute  deviation
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where
m f = 1n (x1 f + x2 f + ... + xnf ) .
x −m if f
n standardized   measure   (z-­‐score): zif = sf
n Using  mean  absolute  deviation  is  more  robust  than  using  standard  
deviation  

53
Example:
Data Matrix and Dissimilarity Matrix

x2 x4
Data  Matrix
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity  Matrix  
(with  Euclidean  Distance)
x3
0 2 4 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

54
Distance on Numeric Data: Minkowski Distance
n Minkowski  distance:  A  popular  distance  measure

where    i =  (xi1,  xi2,  …,  xip)  and j =  (xj1,  xj2,  …,  xjp)  are  two  p-­‐
dimensional  data  objects,  and  h is  the  order  (the  distance  
so  defined  is  also  called  L-­‐h norm)
n Properties
n d(i,  j)  >  0  if  i  ≠  j,  and  d(i,  i)  =  0  (Positive  definiteness)
n d(i,  j)  =  d(j,  i) (Symmetry)
n d(i,  j)  ≤ d(i,  k)  +  d(k,  j) (Triangle  Inequality)
n A  distance  that  satisfies  these  properties  is  a  metric
55
Special Cases of Minkowski Distance
n h =  1:    Manhattan (city  block,  L1 norm) distance
n E.g.,  the  Hamming  distance:  the  number  of  bits  that  are  different  

between   two  binary  vectors


d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

n h  =  2:    (L2 norm)  Euclidean distance


d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

n h  → ∞.    “supremum” (Lmax  norm,  L∞ norm)  distance.  


n This  is  the  maximum  difference   between   any  component  (attribute)  

of  the  vectors

56
Example: Minkowski Distance
Dissimilarity  Matrices
point attribute 1 attribute 2 Manhattan   (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean  (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum  
L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
57
Ordinal Variables

n An  ordinal  variable  can  be  discrete  or  continuous


n Order  is  important,  e.g.,  rank
n Can  be  treated  like  interval-­‐scaled  
n replace  xif by  their  rank   rif ∈{1,..., M f }
n map  the  range  of  each  variable  onto  [0,  1]  by  replacing i-­‐th  
object  in  the  f-­‐th  variable  by
rif −1
zif =
M f −1
n compute  the  dissimilarity  using  methods  for  interval-­‐scaled  
variables

58
Attributes of Mixed Type
n A  database  may  contain  all  attribute  types
n Nominal,  symmetric  binary,  asymmetric  binary,  numeric,  
ordinal
n One  may  use  a  weighted  formula  to  combine  their  effects
Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )
n f is  binary  or  nominal:
dij(f) =  0    if  xif  =  xjf ,  or  dij(f) =  1  otherwise
n f is  numeric:  use  the  normalized  distance
n f is  ordinal  
n Compute  ranks  rif and    
zif = rif − 1
n Treat  zif as  interval-­‐scaled M f −1
59
Cosine Similarity
n A  document can  be  represented   by  thousands  of  attributes,  each  recording  the  
frequency of  a  particular  word  (such  as  keywords)  or  phrase  in  the  document.

n Other  vector  objects:  gene  features   in  micro-­‐arrays,  …


n Applications:  information  retrieval,   biologic  taxonomy,  gene  feature   mapping,  
...
n Cosine  measure:  If  d1 and  d2 are  two  vectors   (e.g.,  term-­‐frequency   vectors),  
then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

60
Example: Cosine Similarity
n cos(d1,  d2)  =    (d1 • d2)  /||d1||  ||d2||  ,  
where • indicates vector dot product, ||d|: the length of vector d

n Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

61
Chapter 2: Getting to Know Your Data

n Data  Objects  and  Attribute  Types

n Basic  Statistical  Descriptions  of  Data

n Data  Visualization

n Measuring  Data   Similarity  and  Dissimilarity

n Summary

62
Summary
n Data  attribute  types:  nominal,  binary,  ordinal,  interval-­‐scaled,  
ratio-­‐scaled
n Many  types  of  data  sets,  e.g.,  numerical,  text,  graph,  Web,  
image.
n Gain  insight  into  the  data  by:
n Basic  statistical  data  description:  central  tendency,  

dispersion,    graphical  displays


n Data  visualization:  map  data  onto  graphical  primitives

n Measure  data  similarity

n Above  steps  are  the  beginning  of  data  preprocessing  


n Many  methods  have  been  developed   but  still  an  active  area  of  
research
References
n W.  Cleveland,  Visualizing   Data,   Hobart  Press,   1993
n T.  Dasu   and  T.   Johnson.    Exploratory  Data   Mining   and   Data  Cleaning.  John  Wiley,   2003
n U.  Fayyad,  G.   Grinstein,   and   A.  Wierse.   Information   Visualization  in   Data  Mining  and  
Knowledge   Discovery,  Morgan  Kaufmann,   2001
n L.  Kaufman   and  P.  J.  Rousseeuw.  Finding   Groups  in   Data:   an  Introduction   to   Cluster  
Analysis.   John   Wiley   &  Sons,  1990.
n H.  V.  Jagadish   et   al.,   Special   Issue  on  Data   Reduction  Techniques.    Bulletin   of  the   Tech.  
Committee   on  Data   Eng.,  20(4),  Dec.  1997
n D.  A.  Keim.   Information   visualization   and   visual   data   mining,   IEEE   trans.  on  
Visualization  and   Computer   Graphics,  8(1),  2002
n D.  Pyle.  Data   Preparation   for  Data   Mining.   Morgan  Kaufmann,   1999
n S. Santini   and   R. Jain,”   Similarity   measures”,  IEEE   Trans.  on   Pattern   Analysis   and  
Machine  Intelligence,  21(9),  1999
n E.  R.   Tufte.   The   Visual   Display   of  Quantitative   Information,  2nd ed.,  Graphics   Press,   2001
n C.  Yu   et   al.,  Visual  data   mining   of  multimedia   data   for  social   and   behavioral   studies,  
Information   Visualization,   8(1),  2009  

You might also like