0% found this document useful (0 votes)
28 views

Data Preprocessing Before Classification: Presented by

Data preprocessing involves collecting data, preparing the data which includes handling missing data, categorical data, inconsistent data and outliers. The goals of preprocessing are to reduce noise, enhance the signal, reduce the input space through techniques like principal component analysis and eliminating correlated variables, perform feature extraction, and normalize the data. Common normalization techniques include min-max normalization and z-score normalization.

Uploaded by

A.J Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Data Preprocessing Before Classification: Presented by

Data preprocessing involves collecting data, preparing the data which includes handling missing data, categorical data, inconsistent data and outliers. The goals of preprocessing are to reduce noise, enhance the signal, reduce the input space through techniques like principal component analysis and eliminating correlated variables, perform feature extraction, and normalize the data. Common normalization techniques include min-max normalization and z-score normalization.

Uploaded by

A.J Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data preprocessing before

classification
Presented By:
Outline
• Collecting data
• Preparing data
• Data preprocessing
Collecting data
Collecting data
• Collecting “example
patterns”
– Inputs (vectors of
independent variables)
– Outputs (vectors
dependent variables)
• More data is better
• Begin with an
elementary set of
data
Collecting data
• Choose an appropriate sampling rate for
time-series data.
• Make sure the data measurements units
are consistent.
• Keep non-essential variables not in the
input vector
• Make sure no major structural (systemic)
changes have occurred during collection.
Collecting data
• How much data is enough?
– Training and testing using a subset of data
– If the performance does not increase when full
data is used, data is enough
– There are statistical validating methods (Ch.11)
• Using simulated data
– When it is difficult to collect (sufficient) data
• Realistic
• Representative
Preparing data
Preparing data
• Handling
– Missing data
– Categorical data
– Inconsistent data and outliers
Missing data
• Discard incomplete example patterns
• Manually enter a reasonable, probable, or
expected values
• Use an statistic generated from the example
patterns with that value
– Mean, mode
• Encode missing values explicitly by creating new
indicator variables
• Generate a predictive model to predict each of
the missing data value
Categorical data
• Ordinal:
– Convert to a numerical representation in a
straightforward manner
– “Low”, “medium”, “high” => 0, 1, 2
• Nominal:
– “One of n” representation
– Encode the input variables as n different
binary inputs, when there are n distinct
categories.
Further process of “one of n”
• When n is too large, reduce the number of
inputs in the new encoding.
– Manually
– PCA-based reduction
• Reduce the one-of-n representation to a one-of-m
representation where m is less than n.
– Eigenvalue-based reduction
– Output variable-based reduction
Inconsistent data and outliers
• Removing erroneous data
• Identifying inconsistent data
– Thresholding, filtering
• Outliers
– Data points that lie outside of the normal
region of interest in the input space, which
may be
• Unusual situations that are “correct”
• Misleading or incorrect measurements
Outliers
• Ways to spot outliers
– Plot: box plot, histogram…
– Number of S.D. from the mean
• Handling outliers
– Remove them
• Assumption: the input space where the outliers reside are not
concerned
– “Winzorize” them
• Convert the values of outliers into the values of upper or
lower thresholds.
• Outliers can always be reintroduced into the
satisfying model to study the changes in the
performance of the model.
Ben Shabad
Data preprocessing
Reasons to preprocess data
• Reducing noise
• Enhancing the signal
• Reducing input space
• Feature extraction
• Normalizing data
• Modifying prior probabilities (specific for
classification)
Reducing noise
• Averaging data values
• Thresholding data
– Convert numeric format data into categorical
– E.g. grey-scale => monotone image
Reducing input space
• Principle component analysis (PCA)
– Identify m-dimensional subspace of the n-dimensional
input space
– original n variables are reduced to m variables that are
mutually orthogonal (independent)
• Eliminating correlated input variables
– Identify highly correlated input variables by
• Statistical correlation tests
• Visual inspection of graphed data variables
• Seeing if a data variable can be modeled using one or more
others.
Reducing input space
• Combining non-correlated input variables
• Sensitivity analysis
– If variations of a particular input variable
cause large changes in the estimation model
output, the variable is very significant.
– Sensitivity analysis prunes input variables
based on information provided by both input
and output data.
Normalizing data
• Not “transform to normal distribution”
• For models that perform better
– Non-parametric algorithms implicitly assume
distances in different directions carry the
same weight (e.g. K-nearest neighbor, ”KNN”)
– Backpropagation (BP) and multi-layered
perception (MLP) models often perform better
if all inputs and outputs are normalized
• Avoiding numerical problems
Types of normalization
• Min-max normalization
– It preserves all relationships of the data
values exactly
– It would compress the normal range if
extreme values or outliers exist
• Z-score normalization
• Sigmoidal normalization
Other considerations
• According to the characteristics of the
specific classifiers being used for modeling
– E.g. CHAID uses categorical data directly
• Input variables produce the best modeling
accuracy when exhibiting a uniform or
Gaussian distribution
• Add expert knowledge when preprocessing
data
Get prepared and then go!

You might also like