0% found this document useful (0 votes)

60 views

Data Mining CH2

This document discusses data preprocessing and summarizes key concepts about data types and attributes. It defines attributes as properties or characteristics of data objects and describes four main types of attributes: nominal, ordinal, interval, and ratio. It also outlines important properties of attribute values and how attributes can be transformed while preserving their meaning.

Uploaded by

Phantom Being

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

Data Mining CH2

Uploaded by

Phantom Being

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 69

Data Mining

Chapter 2: Data Preprocessing

Basanta Joshi, PhD
Asst. Prof., Depart of Electronics and Computer Engineering
Program Coordinator, MSc in Information and Communication Engineering
Member, Laboratory for ICT Research and Development (LICT)
Institute of Engineering
[email protected]
http://www.basantajoshi.com.np

D at a Mi ning 1
Outline

• Attributes and Objects

• Types of Data

• Data Quality

• Data Pre-processing

• Various Similarity Measures

• OLAP & Multidimensional Data Analysis

D at a Mi ning 2
What is Data?
Attributes
• Collection of data objects and
their attributes
Tid Refund Marital Taxable
• An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No
– Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No

Objects
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
dimension, or feature 5 No Divorced 95K Yes

• A collection of attributes 6 No Married 60K No

describe an object 7 Yes Divorced 220K No

– Object is also known as record, 8 No Single 85K Yes
point, case, sample, entity, or 9 No Married 75K No
instance 10 No Single 90K Yes
10

D at a Mi ning 3
A More Complete View of Data

• Data may have parts

• The different parts of the data may have relationships

• More generally, data may have structure

• Data can be incomplete

D at a Mi ning 4
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object

• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different

D at a Mi ning 5
Measurement of Length

• The way you measure an attribute may not match the attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and
property of additvity
length. 10 4 properties
of length.
E

15 5

D at a Mi ning 6
Types of Attributes

• There are different types of attributes

–Nominal
• Examples: ID numbers, eye color, zip codes
–Ordinal
• Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height {tall, medium,
short}
–Interval
• Examples: calendar dates, temperatures in Celsius
or Fahrenheit.
–Ratio
• Examples: temperature in Kelvin, length, time,
counts
D at a Mi ning 7
Properties of Attribute Values

• The type of an attribute depends on which of the following

properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful

– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful differences
– Ratio attribute: all 4 properties/operations
D at a Mi ning 8
Difference Between Ratio and Interval

• Is it physically meaningful to say that a temperature of 10 ° is twice

that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?

• Consider measuring the height above average

– If Bill’s height is three inches above average and Bob’s height is
six inches above average, then would we say that Bob is twice as
tall as Bill?
– Is this situation analogous to that of temperature?

D at a Mi ning 9
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,

values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and

meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

This categorization of attributes is due to S. S. Stevens

D at a Mi ning 10
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing

values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and

where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their

zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens

D at a Mi ning 11
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.

D at a Mi ning 12
Asymmetric Attributes

• Only presence (a non-zero attribute value) is regarded as important

• Words present in documents
• Items present in customer transactions

• If we met a friend in the grocery store would we ever say the

following?
“I see our purchases are very similar since we didn’t buy most of the
same things.”

• We need two asymmetric binary attributes to represent one ordinary

binary attribute
– Association analysis uses asymmetric attributes

• Asymmetric attributes typically arise from objects that are sets

D at a Mi ning 13
Some Extensions and Critiques

• Velleman, Paul F., and Leland Wilkinson. "Nominal, ordinal,

interval, and ratio typologies are misleading." The American
Statistician 47, no. 1 (1993): 65-72.

• Mosteller, Frederick, and John W. Tukey. "Data analysis and

regression. A second course in statistics." Addison-Wesley
Series in Behavioral Science: Quantitative Methods,
Reading, Mass.: Addison-Wesley, 1977.

• Chrisman, Nicholas R. "Rethinking levels of measurement

for cartography."Cartography and Geographic Information
Systems 25, no. 4 (1998): 231-242.

D at a Mi ning 14
Critiques

• Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data

• Real data is approximate and noisy

– This can complicate recognition of the proper attribute type
– Treating one attribute type as another may be approximately
correct

D at a Mi ning 15
Critiques …

• Not a good guide for statistical analysis

– May unnecessarily restrict operations and results
• Statistical analysis is often approximate
• Thus, for example, using interval analysis for ordinal values may be
justified
– Transformations are common but don’t preserve scales
• Can transform data to a new scale with better statistical properties
• Many statistical analyses depend only on the distribution

D at a Mi ning 16
More Complicated Examples

• ID numbers
– Nominal, ordinal, or interval?

• Number of cylinders in an automobile engine

– Nominal, ordinal, or ratio?

• Biased Scale
– Interval or Ratio

D at a Mi ning 17
Key Messages for Attribute Types

• The types of operations you choose should be “meaningful” for the type of
data you have
– Distinctness, order, meaningful intervals, and meaningful ratios are only
four properties of data

– The data type you see – often numbers or strings – may not capture all the
properties or may suggest properties that are not there

– Analysis may depend on these other properties of the data

•Many statistical analyses depend only on the distribution

– Many times what is meaningful is measured by statistical significance

– But in the end, what is meaningful is measured by the domain

D at a Mi ning 18
Types of data sets

• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

D at a Mi ning 19
Important Characteristics of Data

–Dimensionality (number of attributes)

• High dimensional data brings a number of challenges

–Sparsity
• Only presence counts

–Resolution
• Patterns depend on the scale

–Size
• Type of analysis may depend on size of data
D at a Mi ning 20
Record Data

• Data that consists of a collection of records, each of which consists

of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

D at a Mi ning 21
Data Matrix

• If data objects have the same fixed set of numeric attributes,

then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute

• Such data set can be represented by an m by n matrix, where

there are m rows, one for each object, and n columns, one
for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

D at a Mi ning 22
Document Data

• Each document becomes a ‘term’ vector

– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

D at a Mi ning 23
Transaction Data

• A special type of record data, where

– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased are
the items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

D at a Mi ning 24
Graph Data

• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6

D at a Mi ning 25
Ordered Data

• Sequences of transactions

Items/Events

An element of
the sequence
D at a Mi ning 26
Ordered Data

• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

D at a Mi ning 27
Ordered Data

• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

D at a Mi ning 28
Data Quality

• Poor data quality negatively affects many data processing

efforts
“The most important point is that poor data quality is an
unfolding disaster.
– Poor data quality costs the typical company at least ten percent
(10%) of revenue; twenty percent (20%) is probably a better
estimate.”
Thomas C. Redman, DM Review, August 2004
• Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default

D at a Mi ning 29
Data Quality …

• What kinds of data quality problems?

• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:

– Noise and outliers
– Missing values
– Duplicate data
– Wrong data

D at a Mi ning 30
Noise

• For objects, noise is an extraneous object

• For attributes, noise refers to modification of original
values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

D at a Mi ning 31
Outliers

• Outliers are data objects with characteristics that are considerably

different than most of the other data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection

• Causes?

D at a Mi ning 32
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values

– Eliminate data objects or variables
– Estimate missing values
• Example: time series of temperature
• Example: census results
– Ignore the missing value during analysis

D at a Mi ning 33
Missing Values …
• Missing completely at random (MCAR)
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
• Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
• Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
• Not possible to know the situation from the data

D at a Mi ning 34
Duplicate Data

• Data set may include data objects that are duplicates, or almost
duplicates of one another
– Major issue when merging data from heterogeneous sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

• When should duplicate data not be removed?

D at a Mi ning 35
Similarity and Dissimilarity Measures

• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity

D at a Mi ning 36
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity

between two objects, x and y, with respect to a single, simple
attribute.

D at a Mi ning 37
Euclidean Distance

• Euclidean Distance

where n is the number of dimensions (attributes) and xk and yk

are, respectively, the kth attributes (components) or data objects
x and y.

 Standardization is necessary, if scales differ.

D at a Mi ning 38
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
D at a Mi ning 39
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean

Distance

Where r is a parameter, n is the number of dimensions

(attributes) and xk and yk are, respectively, the kth attributes
(components) or data objects x and y.

D at a Mi ning 40
Minkowski Distance: Examples

• r = 1. City block (Manhattan, taxicab, L 1 norm) distance.

– A common example of this is the Hamming distance, which
is just the number of bits that are different between two
binary vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance.

– This is the maximum difference between any component of
the vectors

• Do not confuse r with n, i.e., all these distances are defined

for all numbers of dimensions.

D at a Mi ning 41
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix

D at a Mi ning 42
Mahalanobis Distance

𝑇 −1
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 ( 𝐱 , 𝐲 )=(
𝐱 − 𝐲 ) Ʃ (𝐱 − 𝐲)

 is the covariance matrix

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

D at a Mi ning 43
Mahalanobis Distance
Covariance
Matrix:

B A: (0.5, 0.5)
 0.3 0.2

0.2 0.3

B: (0, 1)
A
C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

D at a Mi ning 44
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known
properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
• A distance that satisfies these properties is a metric

D at a Mi ning 45
Common Properties of a Similarity
• Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.

2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data

objects), x and y.

D at a Mi ning 46
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary
attributes

• Compute similarities using the following quantities

f01 = the number of attributes where p was 0 and q was 1
f10 = the number of attributes where p was 1 and q was 0
f00 = the number of attributes where p was 0 and q was 0
f11 = the number of attributes where p was 1 and q was 1

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)

D at a Mi ning 47
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)

f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

D at a Mi ning 48
Cosine Similarity

• If d1 and d2 are two document vectors, then

cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of
vectors, d1 and d2, and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150

D at a Mi ning 49
Extended Jaccard Coefficient (Tanimoto)

• Variation of Jaccard for continuous or count attributes

– Reduces to Jaccard for binary attributes

D at a Mi ning 50
Correlation measures the linear relationship
between objects

D at a Mi ning 51
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

D at a Mi ning 52
Drawback of Correlation

• x = (-3, -2, -1, 0, 1, 2, 3)

• y = (9, 4, 1, 0, 1, 4, 9)

y i = x i2

• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74

• corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 *

3.74 )
=0

D at a Mi ning 53
Comparison of Proximity Measures

• Domain of application
– Similarity measures tend to be specific to the type of attribute and
data
– Record data, images, graphs, sequences, 3D-protein structure, etc.
tend to have different measures
• However, one can talk about various properties that you would like a
proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
• The measure must be applicable to the data and produce results that
agree with domain knowledge

D at a Mi ning 54
Information Based Measures

• Information theory is a well-developed and fundamental disciple

with broad applications

• Some similarity measures are based on information theory

– Mutual information in various versions
– Maximal Information Coefficient (MIC) and related measures
– General and can handle non-linear relationships
– Can be complicated and time intensive to compute

D at a Mi ning 55
Information and Probability

• Information relates to possible outcomes of an event

– transmission of a message, flip of a coin, or measurement of a
piece of data

• The more certain an outcome, the less information that it contains

and vice-versa
– For example, if a coin has two heads, then an outcome of heads
provides no information
– More quantitatively, the information is related the probability of an
outcome
• The smaller the probability of an outcome, the more information it
provides and vice-versa
– Entropy is the commonly used measure

D at a Mi ning 56
Entropy

• For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by

• Entropy is between 0 and log2n and is measured in bits

– Thus, entropy is a measure of how many bits it takes to represent
an observation of X on average

D at a Mi ning 57
Entropy Examples

• For a coin with probability p of heads and probability q = 1 – p of

tails

– For p= 0.5, q = 0.5 (fair coin) H = 1

– For p = 1 or q = 1, H = 0

• What is the entropy of a fair four-sided die?

D at a Mi ning 58
Entropy for Sample Data: Example

Hair Color Count p -plog2p

Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Maximum
Total entropy
100 is log25 = 1.0
2.3219 1.1540

D at a Mi ning 59
Entropy for Sample Data

• Suppose we have
– a number of observations (m) of some attribute, X, e.g., the hair
color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample

• For continuous data, the calculation is harder

D at a Mi ning 60
Mutual Information

• Information one variable provides about another

Formally, , where

H(X,Y) is the joint entropy of X and Y,

Where pij is the probability that the ith value of X and the jth value of Y
occur together

• For discrete variables, this is easy to compute

• Maximum mutual information for discrete variables is

log2(min( nX, nY ), where nX (nY) is the number of values of X (Y)
D at a Mi ning 61
Mutual Information Example
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergrad 45 0.45 0.5184 Undergrad A 5 0.05 0.2161
Grad 55 0.55 0.4744
Undergrad B 30 0.30 0.5211
Total 100 1.00 0.9928
Undergrad C 10 0.10 0.3322

Grade Count p -plog2p Grad A 30 0.30 0.5211

A 35 0.35 0.5301 Grad B 20 0.20 0.4644

B 50 0.50 0.5000 Grad C 5 0.05 0.2161
C 15 0.15 0.4105 Total 100 1.00 2.2710
Total 100 1.00 1.4406

Mutual information of Student Status and Grade = 0.9928 +

1.4406 - 2.2710 = 0.1624
D at a Mi ning 62
Maximal Information Coefficient
• Reshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean,
Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel
associations in large data sets." science 334, no. 6062 (2011): 1518-1524.
• Applies mutual information to two continuous variables
• Consider the possible binnings of the variables into discrete
categories
– nX × nY ≤ N0.6 where
•nX is the number of values of X
•nY is the number of values of Y
•N is the number of samples (observations, data objects)
• Compute the mutual information
– Normalized by log2(min( nX, nY )
• Take the highest value

D at a Mi ning 63
General Approach for Combining Similarities

• Sometimes attributes are of many different types, but an

overall similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the
range [0, 1].
2: Define an indicator variable, k, for the kth attribute as
follows:
k = 0 if the kth attribute is an asymmetric attribute and
both objects have a value of 0, or if one of the objects has
a missing value for the kth attribute
k = 1 otherwise
3. Compute

D at a Mi ning 64
Using Weights to Combine Similarities

• May not want to treat all attributes the same.

– Use non-negative weights 

• Can also define a weighted form of distance

D at a Mi ning 65
Density

• Measures the degree to which data objects are close to each

other in a specified area
• The notion of density is closely related to that of proximity
• Concept of density is typically used for clustering and
anomaly detection
• Examples:
– Euclidean density
• Euclidean density = number of points per unit volume
– Probability density
• Estimate what the distribution of the data looks like
– Graph-based density
• Connectivity

D at a Mi ning 66
Euclidean Density: Grid-based Approach

• Simplest approach is to divide region into a number of

rectangular cells of equal volume and define density as # of
points the cell contains

Grid-based density. Counts for each cell.

D at a Mi ning 67
Euclidean Density: Center-Based

• Euclidean density is the number of points within a specified

radius of the point

Illustration of center-based density.

D at a Mi ning 68
Thank you !!!

D at a Mi ning 69

Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
96 pages
ESM O&M 3rd Edition
No ratings yet
ESM O&M 3rd Edition
202 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
chapter 2
No ratings yet
chapter 2
57 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Chapter-2 (Data)
No ratings yet
Chapter-2 (Data)
95 pages
Full
No ratings yet
Full
367 pages
Data
No ratings yet
Data
84 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
2-Data_Preprocessing
No ratings yet
2-Data_Preprocessing
104 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Class 2 Introduction to Data
No ratings yet
Class 2 Introduction to Data
40 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Attributes
No ratings yet
Attributes
66 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
DM Unit1_1 INTRODUCTION TO DATA MINING and types of data 19I504
No ratings yet
DM Unit1_1 INTRODUCTION TO DATA MINING and types of data 19I504
42 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Chap2 Data
No ratings yet
Chap2 Data
68 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
clustering_vivek_saxena
No ratings yet
clustering_vivek_saxena
169 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
datamining-1class
No ratings yet
datamining-1class
76 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Module2 - Preprocessing Updated - V3-2
No ratings yet
Module2 - Preprocessing Updated - V3-2
106 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
DMI UNIT 2
No ratings yet
DMI UNIT 2
19 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
2 Data Types Quality
No ratings yet
2 Data Types Quality
15 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
A Concise Guide to Object Orientated Programming
From Everand
A Concise Guide to Object Orientated Programming
alasdair gilchrist
No ratings yet
Entity Framework Tutorial - Second Edition
From Everand
Entity Framework Tutorial - Second Edition
Joydip Kanjilal
No ratings yet
Figures For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Figures For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
41 pages
Tribhuvan University Institute of Engineering, Pulchowk Campus Lalitpur, Nepal
No ratings yet
Tribhuvan University Institute of Engineering, Pulchowk Campus Lalitpur, Nepal
8 pages
Evaluation of Output of An LTI System Using Convolution: Avishekh Shrestha
No ratings yet
Evaluation of Output of An LTI System Using Convolution: Avishekh Shrestha
2 pages
Introduction To Electromagnetic Relay: Experiment No. 4
No ratings yet
Introduction To Electromagnetic Relay: Experiment No. 4
1 page
Tribhuwan University: Institute of Engineering Central Campus, Pulchowk
No ratings yet
Tribhuwan University: Institute of Engineering Central Campus, Pulchowk
1 page
A Study in Project Failure: June 2008
No ratings yet
A Study in Project Failure: June 2008
14 pages
Your A To Z Guide To Preparing Narrative For Suspicious Activity Report
No ratings yet
Your A To Z Guide To Preparing Narrative For Suspicious Activity Report
13 pages
"Online Payment System": Tanla Platforms LTD
No ratings yet
"Online Payment System": Tanla Platforms LTD
8 pages
Sana Raises $34M For Its AI-based Knowledge Management and Learning Platform For Workplaces - TechCrunch
No ratings yet
Sana Raises $34M For Its AI-based Knowledge Management and Learning Platform For Workplaces - TechCrunch
16 pages
Trator 6415 6615 Classic 2012 Atual
100% (1)
Trator 6415 6615 Classic 2012 Atual
536 pages
CB Test Certificate: Ref. Certif. No
No ratings yet
CB Test Certificate: Ref. Certif. No
2 pages
Eva Cassidy You'Ve Changed Sheet Music in C Major (Transposable) - Download & Print - SKU MN0088089
No ratings yet
Eva Cassidy You'Ve Changed Sheet Music in C Major (Transposable) - Download & Print - SKU MN0088089
1 page
Ligier Js p320 - User Manuel - G- Electronics v1.8
No ratings yet
Ligier Js p320 - User Manuel - G- Electronics v1.8
54 pages
Checklist - NTNU Formalities
No ratings yet
Checklist - NTNU Formalities
20 pages
D 2 y dx2
No ratings yet
D 2 y dx2
3 pages
Eight Most Profitable Youtube Niches Format For PDF
No ratings yet
Eight Most Profitable Youtube Niches Format For PDF
19 pages
Institute Name: Sree Vidyanikethan Engineering College (IR-O-C-26929)
No ratings yet
Institute Name: Sree Vidyanikethan Engineering College (IR-O-C-26929)
25 pages
Revise The First Conditional With A Song
No ratings yet
Revise The First Conditional With A Song
3 pages
Lenovo Ideapad 130-15ast La-G241p r10-180312
No ratings yet
Lenovo Ideapad 130-15ast La-G241p r10-180312
41 pages
L4. Datawarehouse Architecture PDF
No ratings yet
L4. Datawarehouse Architecture PDF
13 pages
Chapter 1 Embedded Systems
No ratings yet
Chapter 1 Embedded Systems
10 pages
Fault Tolerance Via Idempotence
No ratings yet
Fault Tolerance Via Idempotence
14 pages
Ffu 0000034 01
No ratings yet
Ffu 0000034 01
8 pages
Resume Mohammed Zohery 2
No ratings yet
Resume Mohammed Zohery 2
1 page
A Modern Approach To Learning & Employability "Light Runner "
No ratings yet
A Modern Approach To Learning & Employability "Light Runner "
21 pages
ResCom Case Study Sem-V
100% (1)
ResCom Case Study Sem-V
51 pages
95x30mm UHF LED Label 332571 A
No ratings yet
95x30mm UHF LED Label 332571 A
4 pages
Projects
No ratings yet
Projects
191 pages
Java Training Section 1
No ratings yet
Java Training Section 1
43 pages
Flow Computer FT-6000 Manual
No ratings yet
Flow Computer FT-6000 Manual
4 pages
Kavan N_Resume
No ratings yet
Kavan N_Resume
4 pages
UTS Brochure
No ratings yet
UTS Brochure
20 pages
Building Information Systems
100% (1)
Building Information Systems
12 pages
CH 4
No ratings yet
CH 4
45 pages
Community Service Opportunities
No ratings yet
Community Service Opportunities
7 pages