0% found this document useful (0 votes)
20 views

Data Preprocessing II

This document discusses data pre-processing techniques, including data cleaning and integration. It describes how to construct data matrices and dissimilarity matrices to measure similarity between objects. Various techniques for handling missing values and noise are presented, such as binning, regression, and outlier analysis. Common data integration challenges like entity identification, redundancy analysis, and data conflicts are outlined. Solutions involve understanding metadata, normalization, and maintaining constraints to prevent duplication. The overall goal is to prepare raw data for effective analysis.

Uploaded by

Dhruvi Thakrar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Data Preprocessing II

This document discusses data pre-processing techniques, including data cleaning and integration. It describes how to construct data matrices and dissimilarity matrices to measure similarity between objects. Various techniques for handling missing values and noise are presented, such as binning, regression, and outlier analysis. Common data integration challenges like entity identification, redundancy analysis, and data conflicts are outlined. Solutions involve understanding metadata, normalization, and maintaining constraints to prevent duplication. The overall goal is to prepare raw data for effective analysis.

Uploaded by

Dhruvi Thakrar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Pre-processing II – Data Cleaning &

Integration

Symbiosis International (Deemed University)


Session Objectives
By the end of this session, you will be able to:

Understand about Data Matrix and Dissimilarity


Matrix in terms of Similarity Measure of Various
Attribute Types & Documents.

 Figure out the process of Data Cleaning.

 Understand the Data Integration Problems &


Solution Approach.
Data Matrix & Dissimilarity Matrix
Suppose that we have n objects (e.g., persons, items, or courses) described by p attributes (also called measurements or featur
such as age, height, weight, or gender). The objects are x1 = (x11, x12,...,x1p), x2 = (x21, x22,...,x2p), and so on, where xij is
the value for object xi of the j th attribute.
Data Matrix & Dissimilarity Matrix

where d(i, j ) is the measured dissimilarity or “difference” between objects i and j . In general, d(i, j ) is a nonnegative
number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more they
differ.
Data Matrix & Dissimilarity Matrix
Similarity Matrix for Nominal Attributes
 A nominal attribute can take on two or more states.
 map_color is a nominal attribute that may have, say, five states: red, yellow, green, pink, and blue.
Similarity Matrix for Nominal Attributes

Since here we have one nominal attribute, test-1, we set p = 1. d(i, j ) evaluates to 0 if objects i
and j match, and 1 if the objects differ.

Following is the Dissimilarity Matrix


Similarity Matrix for Binary Attributes
 Binary attribute has only one of two states, 0 and 1, where 0 means that the attribute is absent, and 1
means that it is present.
 The attribute smoker describing a patient, for instance, 1 indicates that the patient smokes, whereas 0
indicates that the patient does not.
 2 × 2 contingency table

If two states are equally


important

If two states are not equally


important
Similarity Matrix for Binary Attributes
Similarity Matrix for Numeric Attributes
 This involves transforming the data to fall within a smaller or common range, such as [−1.0, 1.0] or [0.0, 1.0]
 Height attribute, for example, which could be measured in either meters or inches.
 The most popular distance measure is Euclidean distance (i.e., straight line or “as the crow flies”).
 Let i = (xi1, xi2,...,xip) and j = (xj1, xj2,...,xjp) be two objects described by p numeric attributes.
Similarity Matrix for Ordinal Attributes
 Ordinal attributes may also be obtained from the discretization of numeric attributes by splitting the value range
into a finite number of categories.
 The range of a numeric attribute can be mapped to an ordinal attribute f having Mf states.
 For example, the range of the interval-scaled attribute temperature (in Celsius) can be organized into the
following states: −30 to −10, −10 to 10, and 10 to 30, representing the categories cold temperature, moderate
temperature, and warm temperature, respectively.
 Mf be the number of states that the ordinal attribute would have and rank them numerically i.e. 1, 2, …, Mf.
 Suppose that f is an attribute from a set of ordinal attributes describing n objects.
 The dissimilarity can be obtained as:

 The dissimilarity can be calculated using Euclidian Distance. for two objects.
Similarity Matrix for Ordinal Attributes

There are three states for test-2: fair, good, and excellent, that is, Mf = 3.
For step 1, if we replace each value for test-2 by its rank, the four objects are assigned the ranks 3, 1, 2, and 3,
respectively.
Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0.
For step 3, we can use, say, the Euclidean distance
The Dissimilarity matrix looks as follows.
Similarity Matrix for Mixed Attributes
 The technique combines the different attributes into a single dissimilarity matrix, bringing all of the meaningful
attributes onto a common scale of the interval [0.0, 1.0].
 Suppose that the data set contains p attributes of mixed types. The dissimilarity d(i, j ) between objects i and j is
defined as,
Similarity Matrix for Mixed Attributes

Dissimilarity for Test-1 Dissimilarity for Test-2 Dissimilarity for Test-3

The Overall Dissimilarity of (3,1) = [1(1+0) + 1(0.5+0) + 1(0.45+0)] / 3 = 0.65

Dissimilarity for All attributes of


Mixed Type
Document Similarity-Cosine
 Cosine similarity is a measure of similarity that can be used to compare documents or, say, give a ranking of
documents with respect to a given vector of query words.
 Let x and y be two vectors then,
Data Cleaning
 Handling Missing Values: Imagine that you need to analyze the sales and customer data of a company. You note
that many tuples have no recorded value for several attributes such as customer income. How can you go about
filling in the missing values for this attribute?
 Ignore the Tuple:
 Fill in the missing value manually:
 Use a global constant to fill in the missing value:
 Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value.
 Use the attribute mean or median for all samples belonging to the same class as the given tuple:
 Use the most probable value to fill in the missing value:

 Caution!! It is important to note that, in some cases, a missing value may not imply an error in the data! For example,
when applying for a credit card, candidates may be asked to supply their driver’s license number. Candidates who
do not have a driver’s license may naturally leave this field blank.

 Handling Noise: Noise is a random error or variance in a measured variable.


 Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Data Cleaning
 Handling Noise: Noise is a random error or variance in a measured variable.
 Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it.
The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Data Cleaning
 Handling Noise: Noise is a random error or variance in a measured variable.
 Regression: Data smoothing can also be done by regression, a technique that conforms data values to a function.
Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used
to predict the other.
Multiple linear regression is an extension of linear regression, where more than two attributes are involved, and the
data are fit to a multidimensional surface.
 Outlier Analysis: Outliers may be detected by clustering, for example, where similar values are organized into
groups or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered as outliers
Data Integration
The semantic heterogeneity and structure of data pose great challenges in data integration.
 Entity Identification Problem:
 customer_id in one database and cust_number in another refer to the same attribute?
 Data codes for pay_type in one database may be “H” and “S” but 1 and 2 in another.
 Solution - Understanding of Metadata: Metadata for each attribute include the name, meaning, data
type, range of values permitted for the attribute, and null rules for handling blank, zero, or null values.
 Special attention must be paid to the structure of the data. E.g. one system, a discount may be applied to
the order, whereas in another system, it is applied to each individual line item within the order.

 Redundancy and Correlation Analysis for Dimensionality Reduction:


 Gender | Whether there is a difference in Male and Female Proportion? | H1: Yes, H0: No | Test: One Sample
Proportion Test since only one categorical variable. | P ≤ 0.05 then Reject H0.
 Gender & Age Group | Is there is any difference over Male and Female Proportion based on Age Group? | H1: Yes,
H0: No | Test: Chi-squared Test since two categorical variables. | P ≤ 0.05 then Reject H0.
 Numeric Feature like Height | Test: T-test | One numeric variable
 Two Numerical Variables | Test: Correlation (-1 to + 1)
 One Numerical and One Categorical | If categorical has two categories  T-test otherwise ANOVA Test

 Tuple Duplication: Common Problem to occur, Solution: Maintain Constraints while inserting or
updating data into different data structure.

 Data value conflict detection and resolution: Solution: Concept Hirarchy, Normalization.
Session Outcomes
In this session you learned about:
 Data Matrix and Dissimilarity Matrix in terms of Similarity
Measure.

 The steps involved & techniques to Clean Data.

 Data Integration Problems & Solution Approach.


Thank You

You might also like