Data Preprocessing II
Data Preprocessing II
Integration
where d(i, j ) is the measured dissimilarity or “difference” between objects i and j . In general, d(i, j ) is a nonnegative
number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more they
differ.
Data Matrix & Dissimilarity Matrix
Similarity Matrix for Nominal Attributes
A nominal attribute can take on two or more states.
map_color is a nominal attribute that may have, say, five states: red, yellow, green, pink, and blue.
Similarity Matrix for Nominal Attributes
Since here we have one nominal attribute, test-1, we set p = 1. d(i, j ) evaluates to 0 if objects i
and j match, and 1 if the objects differ.
The dissimilarity can be calculated using Euclidian Distance. for two objects.
Similarity Matrix for Ordinal Attributes
There are three states for test-2: fair, good, and excellent, that is, Mf = 3.
For step 1, if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3,
respectively.
Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
For step 3, we can use, say, the Euclidean distance
The Dissimilarity matrix looks as follows.
Similarity Matrix for Mixed Attributes
The technique combines the different attributes into a single dissimilarity matrix, bringing all of the meaningful
attributes onto a common scale of the interval [0.0, 1.0].
Suppose that the data set contains p attributes of mixed types. The dissimilarity d(i, j ) between objects i and j is
defined as,
Similarity Matrix for Mixed Attributes
Caution!! It is important to note that, in some cases, a missing value may not imply an error in the data! For example,
when applying for a credit card, candidates may be asked to supply their driver’s license number. Candidates who
do not have a driver’s license may naturally leave this field blank.
Tuple Duplication: Common Problem to occur, Solution: Maintain Constraints while inserting or
updating data into different data structure.
Data value conflict detection and resolution: Solution: Concept Hirarchy, Normalization.
Session Outcomes
In this session you learned about:
Data Matrix and Dissimilarity Matrix in terms of Similarity
Measure.