0% found this document useful (0 votes)
5 views

Data Preprocessingedfgh

asdfgnm,
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Preprocessingedfgh

asdfgnm,
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Mining and

Predictive Modelling
Data Preprocessing
• Data Preprocessing
• Data Quality
• Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
Accuracy: correct or wrong, accurate or not

Data Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling, …


Quality Timeliness: timely update?
(Measures) Believability: how trustable the data are correct?

Interpretability: how easily the data can be understood?


Tasks in Data Preprocessing
• Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
Data Cleaning

• The real-world data is highly dirty such as:


• Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• E.g., Profession=“ ” (missing data)
• Noisy: containing noise, errors, or outliers
• E.g., salary= ‘-10’ (an error)
• Inconsistent: containing discrepancies in codes or names, e.g.,
• Discrepancy between duplicate records
• Was rating “1,2,3”, given rating “A,B,C”
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Incomplete (Missing) Data

Data is not always available


• E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data

Missing data may be due to


• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data my not be considered important at the time of entry
• Not register history or changes of the data

Missing data may need to be inferred


Handling missing data
• Ignore the tuple, usually done when class label is missing (when doing classification)—
not effective when the % of missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or decision tree
Noise: random error or variance in a measured
variable

Incorrect attribute values may be due to

• faulty data collection instruments

Noisy Data
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention

Other data problems which require data cleaning

• duplicate records
• incomplete data
• inconsistent data
Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

Handling Regression
• smooth by fitting the data into regression functions

Noisy Data Clustering


• detect and remove outliers

Combined computer and human inspection


• detect suspicious values and check by human (e.g.,
deal with possible outliers)
Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule

Data
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Cleaning as
• Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)

a Process Data migration and integration


• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface

Integration of the two processes


• Iterative and interactive
Data Integration
• Data integration:
• Combines data from multiple sources into a
coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
Data • Entity identification problem:

Integration
• Identify real world entities from multiple data
sources
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values
from different sources are different
• Possible reasons: different representations,
different scales, e.g., metric vs. British units
• Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or
Handling object may have different names in different
databases

Redundancy • Derivable data: One attribute may be a


“derived” attribute in another table, e.g.,
annual revenue
in Data • Redundant attributes may be able to be detected
by correlation analysis and covariance analysis
Integration • Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
(Observed  Expected) 2
 
2

Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is
very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B
where n is the number of tuples, and are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.

• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, and are Athe respective


B mean or expected values of A
and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be
smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent. Only under
some additional assumptions (e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence
16
• Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or
almost the same) analytical results
• Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.

Data
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant
attributes

Reduction • Wavelet transforms


• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression
Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering,
outlier analysis, becomes less meaningful

Data • The possible combinations of subspaces will grow exponentially

Dimensionality reduction
Reduction: • Avoid the curse of dimensionality
Dimensionality • Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
Reduction • Allow easier visualization
Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)
Data Reduction: Principal Component Analysis
x2

Find a projection that captures the The original data are projected onto a
largest amount of variation in data much smaller space, resulting in
dimensionality reduction. We find the
eigenvectors of the covariance matrix,
and these eigenvectors define the new
space

x1
Data Another way to reduce dimensionality of data

Redundant attributes
Reduction: • Duplicate much or all of the information contained in
one or more other attributes

Attribute • E.g., purchase price of a product and the amount of


sales tax paid

Subset Irrelevant attributes


• Contain no information that is useful for the data

Selection
mining task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Thank You

You might also like