Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data Cleaning
Data cleaning tasks
Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Clustering
detect and remove outliers
Regression
smooth by fitting the data into regression functions
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
Data Transformation: Normalization min-max normalization v min v' (new _ max new _ min ) new _ min max min
A A A A A
Dimensionality Reduction
Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns, easier to understand
Discretization
Three types of attributes:
Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers
Discretization:
divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis
Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).