Data Preprocessingedfgh
Data Preprocessingedfgh
Predictive Modelling
Data Preprocessing
• Data Preprocessing
• Data Quality
• Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
Accuracy: correct or wrong, accurate or not
Noisy Data
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• duplicate records
• incomplete data
• inconsistent data
Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Handling Regression
• smooth by fitting the data into regression functions
Data
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Cleaning as
• Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
Integration
• Identify real world entities from multiple data
sources
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values
from different sources are different
• Possible reasons: different representations,
different scales, e.g., metric vs. British units
• Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or
Handling object may have different names in different
databases
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
where n is the number of tuples, and are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
Data
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant
attributes
Dimensionality reduction
Reduction: • Avoid the curse of dimensionality
Dimensionality • Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
Reduction • Allow easier visualization
Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
Data Reduction: Principal Component Analysis
x2
Find a projection that captures the The original data are projected onto a
largest amount of variation in data much smaller space, resulting in
dimensionality reduction. We find the
eigenvectors of the covariance matrix,
and these eigenvectors define the new
space
x1
Data Another way to reduce dimensionality of data
Redundant attributes
Reduction: • Duplicate much or all of the information contained in
one or more other attributes
Selection
mining task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Thank You