0% found this document useful (0 votes)
60 views

Data Mining CH2

This document discusses data preprocessing and summarizes key concepts about data types and attributes. It defines attributes as properties or characteristics of data objects and describes four main types of attributes: nominal, ordinal, interval, and ratio. It also outlines important properties of attribute values and how attributes can be transformed while preserving their meaning.

Uploaded by

Phantom Being
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Data Mining CH2

This document discusses data preprocessing and summarizes key concepts about data types and attributes. It defines attributes as properties or characteristics of data objects and describes four main types of attributes: nominal, ordinal, interval, and ratio. It also outlines important properties of attribute values and how attributes can be transformed while preserving their meaning.

Uploaded by

Phantom Being
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Data Mining

Chapter 2: Data Preprocessing


Basanta Joshi, PhD
Asst. Prof., Depart of Electronics and Computer Engineering
Program Coordinator, MSc in Information and Communication Engineering
Member, Laboratory for ICT Research and Development (LICT)
Institute of Engineering
[email protected]
http://www.basantajoshi.com.np

D at a Mi ning 1
Outline

• Attributes and Objects

• Types of Data

• Data Quality

• Data Pre-processing

• Various Similarity Measures

• OLAP & Multidimensional Data Analysis

D at a Mi ning 2
What is Data?
Attributes
• Collection of data objects and
their attributes
Tid Refund Marital Taxable
• An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No
– Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No

Objects
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
dimension, or feature 5 No Divorced 95K Yes

• A collection of attributes 6 No Married 60K No

describe an object 7 Yes Divorced 220K No


– Object is also known as record, 8 No Single 85K Yes
point, case, sample, entity, or 9 No Married 75K No
instance 10 No Single 90K Yes
10

D at a Mi ning 3
A More Complete View of Data

• Data may have parts

• The different parts of the data may have relationships

• More generally, data may have structure

• Data can be incomplete

D at a Mi ning 4
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object

• Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values


• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different

D at a Mi ning 5
Measurement of Length

• The way you measure an attribute may not match the attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and
property of additvity
length. 10 4 properties
of length.
E

15 5

D at a Mi ning 6
Types of Attributes

• There are different types of attributes


–Nominal
• Examples: ID numbers, eye color, zip codes
–Ordinal
• Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium,
short}
–Interval
• Examples: calendar dates, temperatures in Celsius
or Fahrenheit.
–Ratio
• Examples: temperature in Kelvin, length, time,
counts
D at a Mi ning 7
Properties of Attribute Values

• The type of an attribute depends on which of the following


properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful differences
– Ratio attribute: all 4 properties/operations
D at a Mi ning 8
Difference Between Ratio and Interval

• Is it physically meaningful to say that a temperature of 10 ° is twice


that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?

• Consider measuring the height above average


– If Bill’s height is three inches above average and Bob’s height is
six inches above average, then would we say that Bob is twice as
tall as Bill?
– Is this situation analogous to that of temperature?

D at a Mi ning 9
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens


D at a Mi ning 10
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens


D at a Mi ning 11
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.

D at a Mi ning 12
Asymmetric Attributes

• Only presence (a non-zero attribute value) is regarded as important


• Words present in documents
• Items present in customer transactions

• If we met a friend in the grocery store would we ever say the


following?
“I see our purchases are very similar since we didn’t buy most of the
same things.”

• We need two asymmetric binary attributes to represent one ordinary


binary attribute
– Association analysis uses asymmetric attributes

• Asymmetric attributes typically arise from objects that are sets


D at a Mi ning 13
Some Extensions and Critiques

• Velleman, Paul F., and Leland Wilkinson. "Nominal, ordinal,


interval, and ratio typologies are misleading." The American
Statistician 47, no. 1 (1993): 65-72.

• Mosteller, Frederick, and John W. Tukey. "Data analysis and


regression. A second course in statistics." Addison-Wesley
Series in Behavioral Science: Quantitative Methods,
Reading, Mass.: Addison-Wesley, 1977.

• Chrisman, Nicholas R. "Rethinking levels of measurement


for cartography."Cartography and Geographic Information
Systems 25, no. 4 (1998): 231-242.

D at a Mi ning 14
Critiques

• Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data

• Real data is approximate and noisy


– This can complicate recognition of the proper attribute type
– Treating one attribute type as another may be approximately
correct

D at a Mi ning 15
Critiques …

• Not a good guide for statistical analysis


– May unnecessarily restrict operations and results
• Statistical analysis is often approximate
• Thus, for example, using interval analysis for ordinal values may be
justified
– Transformations are common but don’t preserve scales
• Can transform data to a new scale with better statistical properties
• Many statistical analyses depend only on the distribution

D at a Mi ning 16
More Complicated Examples

• ID numbers
– Nominal, ordinal, or interval?

• Number of cylinders in an automobile engine


– Nominal, ordinal, or ratio?

• Biased Scale
– Interval or Ratio

D at a Mi ning 17
Key Messages for Attribute Types

• The types of operations you choose should be “meaningful” for the type of
data you have
– Distinctness, order, meaningful intervals, and meaningful ratios are only
four properties of data

– The data type you see – often numbers or strings – may not capture all the
properties or may suggest properties that are not there

– Analysis may depend on these other properties of the data


•Many statistical analyses depend only on the distribution

– Many times what is meaningful is measured by statistical significance

– But in the end, what is meaningful is measured by the domain


D at a Mi ning 18
Types of data sets

• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

D at a Mi ning 19
Important Characteristics of Data

–Dimensionality (number of attributes)


• High dimensional data brings a number of challenges

–Sparsity
• Only presence counts

–Resolution
• Patterns depend on the scale

–Size
• Type of analysis may depend on size of data
D at a Mi ning 20
Record Data

• Data that consists of a collection of records, each of which consists


of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

D at a Mi ning 21
Data Matrix

• If data objects have the same fixed set of numeric attributes,


then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute

• Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one
for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

D at a Mi ning 22
Document Data

• Each document becomes a ‘term’ vector


– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

D at a Mi ning 23
Transaction Data

• A special type of record data, where


– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

D at a Mi ning 24
Graph Data

• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


D at a Mi ning 25
Ordered Data

• Sequences of transactions

Items/Events

An element of
the sequence
D at a Mi ning 26
Ordered Data

• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

D at a Mi ning 27
Ordered Data

• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

D at a Mi ning 28
Data Quality

• Poor data quality negatively affects many data processing


efforts
“The most important point is that poor data quality is an
unfolding disaster.
– Poor data quality costs the typical company at least ten percent
(10%) of revenue; twenty percent (20%) is probably a better
estimate.”
Thomas C. Redman, DM Review, August 2004
• Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default

D at a Mi ning 29
Data Quality …

• What kinds of data quality problems?


• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


– Noise and outliers
– Missing values
– Duplicate data
– Wrong data

D at a Mi ning 30
Noise

• For objects, noise is an extraneous object


• For attributes, noise refers to modification of original
values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise


D at a Mi ning 31
Outliers

• Outliers are data objects with characteristics that are considerably


different than most of the other data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection

• Causes?

D at a Mi ning 32
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate data objects or variables
– Estimate missing values
• Example: time series of temperature
• Example: census results
– Ignore the missing value during analysis

D at a Mi ning 33
Missing Values …
• Missing completely at random (MCAR)
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
• Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
• Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
• Not possible to know the situation from the data

D at a Mi ning 34
Duplicate Data

• Data set may include data objects that are duplicates, or almost
duplicates of one another
– Major issue when merging data from heterogeneous sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

• When should duplicate data not be removed?

D at a Mi ning 35
Similarity and Dissimilarity Measures

• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity

D at a Mi ning 36
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity


between two objects, x and y, with respect to a single, simple
attribute.

D at a Mi ning 37
Euclidean Distance

• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk


are, respectively, the kth attributes (components) or data objects
x and y.

 Standardization is necessary, if scales differ.

D at a Mi ning 38
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
D at a Mi ning 39
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean


Distance

Where r is a parameter, n is the number of dimensions


(attributes) and xk and yk are, respectively, the kth attributes
(components) or data objects x and y.

D at a Mi ning 40
Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L 1 norm) distance.


– A common example of this is the Hamming distance, which
is just the number of bits that are different between two
binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component of
the vectors

• Do not confuse r with n, i.e., all these distances are defined


for all numbers of dimensions.

D at a Mi ning 41
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix

D at a Mi ning 42
Mahalanobis Distance

𝑇 −1
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 ( 𝐱 , 𝐲 )=(
  𝐱 − 𝐲 ) Ʃ (𝐱 − 𝐲)

 is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


D at a Mi ning 43
Mahalanobis Distance
Covariance
Matrix:

B A: (0.5, 0.5)
 0.3 0.2

0.2 0.3

B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

D at a Mi ning 44
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known
properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
• A distance that satisfies these properties is a metric

D at a Mi ning 45
Common Properties of a Similarity
• Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.

2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data


objects), x and y.

D at a Mi ning 46
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary
attributes

• Compute similarities using the following quantities


f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)

D at a Mi ning 47
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)


f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

D at a Mi ning 48
Cosine Similarity

• If d1 and d2 are two document vectors, then


cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of
vectors, d1 and d2, and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150

D at a Mi ning 49
Extended Jaccard Coefficient (Tanimoto)

• Variation of Jaccard for continuous or count attributes


– Reduces to Jaccard for binary attributes

D at a Mi ning 50
Correlation measures the linear relationship
between objects

D at a Mi ning 51
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

D at a Mi ning 52
Drawback of Correlation

• x = (-3, -2, -1, 0, 1, 2, 3)


• y = (9, 4, 1, 0, 1, 4, 9)

y i = x i2

• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 *


3.74 )
=0

D at a Mi ning 53
Comparison of Proximity Measures

• Domain of application
– Similarity measures tend to be specific to the type of attribute and
data
– Record data, images, graphs, sequences, 3D-protein structure, etc.
tend to have different measures
• However, one can talk about various properties that you would like a
proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
• The measure must be applicable to the data and produce results that
agree with domain knowledge

D at a Mi ning 54
Information Based Measures

• Information theory is a well-developed and fundamental disciple


with broad applications

• Some similarity measures are based on information theory


– Mutual information in various versions
– Maximal Information Coefficient (MIC) and related measures
– General and can handle non-linear relationships
– Can be complicated and time intensive to compute

D at a Mi ning 55
Information and Probability

• Information relates to possible outcomes of an event


– transmission of a message, flip of a coin, or measurement of a
piece of data

• The more certain an outcome, the less information that it contains


and vice-versa
– For example, if a coin has two heads, then an outcome of heads
provides no information
– More quantitatively, the information is related the probability of an
outcome
• The smaller the probability of an outcome, the more information it
provides and vice-versa
– Entropy is the commonly used measure

D at a Mi ning 56
Entropy

•  For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by

• Entropy is between 0 and log2n and is measured in bits


– Thus, entropy is a measure of how many bits it takes to represent
an observation of X on average

D at a Mi ning 57
Entropy Examples

•  For a coin with probability p of heads and probability q = 1 – p of


tails

– For p= 0.5, q = 0.5 (fair coin) H = 1


– For p = 1 or q = 1, H = 0

• What is the entropy of a fair four-sided die?

D at a Mi ning 58
Entropy for Sample Data: Example

Hair Color Count p -plog2p


Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Maximum
Total entropy
100 is log25 = 1.0
2.3219 1.1540

D at a Mi ning 59
Entropy for Sample Data

•  Suppose we have
– a number of observations (m) of some attribute, X, e.g., the hair
color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample

• For continuous data, the calculation is harder

D at a Mi ning 60
Mutual Information

•  Information one variable provides about another

Formally, , where

H(X,Y) is the joint entropy of X and Y,

Where pij is the probability that the ith value of X and the jth value of Y
occur together

• For discrete variables, this is easy to compute

• Maximum mutual information for discrete variables is


log2(min( nX, nY ), where nX (nY) is the number of values of X (Y)
D at a Mi ning 61
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergrad 45 0.45 0.5184 Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928
Undergrad C 10 0.10 0.3322

Grade Count p -plog2p Grad A 30 0.30 0.5211

A 35 0.35 0.5301 Grad B 20 0.20 0.4644


B 50 0.50 0.5000 Grad C 5 0.05 0.2161
C 15 0.15 0.4105 Total 100 1.00 2.2710
Total 100 1.00 1.4406

Mutual information of Student Status and Grade = 0.9928 +


1.4406 - 2.2710 = 0.1624
D at a Mi ning 62
Maximal Information Coefficient
• Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean,
Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel
associations in large data sets." science 334, no. 6062 (2011): 1518-1524.
• Applies mutual information to two continuous variables
• Consider the possible binnings of the variables into discrete
categories
– nX × nY ≤ N0.6 where
•nX is the number of values of X
•nY is the number of values of Y
•N is the number of samples (observations, data objects)
• Compute the mutual information
– Normalized by log2(min( nX, nY )
• Take the highest value

D at a Mi ning 63
General Approach for Combining Similarities

• Sometimes attributes are of many different types, but an


overall similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the
range [0, 1].
2: Define an indicator variable, k, for the kth attribute as
follows:
k = 0 if the kth attribute is an asymmetric attribute and
both objects have a value of 0, or if one of the objects has
a missing value for the kth attribute
k = 1 otherwise
3. Compute

D at a Mi ning 64
Using Weights to Combine Similarities

•  May not want to treat all attributes the same.


– Use non-negative weights 

• Can also define a weighted form of distance

D at a Mi ning 65
Density

• Measures the degree to which data objects are close to each


other in a specified area
• The notion of density is closely related to that of proximity
• Concept of density is typically used for clustering and
anomaly detection
• Examples:
– Euclidean density
• Euclidean density = number of points per unit volume
– Probability density
• Estimate what the distribution of the data looks like
– Graph-based density
• Connectivity

D at a Mi ning 66
Euclidean Density: Grid-based Approach

• Simplest approach is to divide region into a number of


rectangular cells of equal volume and define density as # of
points the cell contains

Grid-based density. Counts for each cell.


D at a Mi ning 67
Euclidean Density: Center-Based

• Euclidean density is the number of points within a specified


radius of the point

Illustration of center-based density.


D at a Mi ning 68
Thank you !!!

D at a Mi ning 69

You might also like