0% found this document useful (0 votes)
51 views

Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11

The document discusses various techniques for data pre-processing which are necessary steps to ensure quality data for data mining and analytics. It covers topics such as data cleaning to handle missing values and outliers, data integration and reduction strategies like dimensionality reduction and discretization, as well as common transformation techniques like normalization. The goal of data pre-processing is to prepare raw data into a format suitable for mining by cleaning, integrating, reducing, and transforming data from multiple sources into an organized and consistent format.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11

The document discusses various techniques for data pre-processing which are necessary steps to ensure quality data for data mining and analytics. It covers topics such as data cleaning to handle missing values and outliers, data integration and reduction strategies like dimensionality reduction and discretization, as well as common transformation techniques like normalization. The goal of data pre-processing is to prepare raw data into a format suitable for mining by cleaning, integrating, reducing, and transforming data from multiple sources into an organized and consistent format.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

DATA PRE-PROCESSING

Submitted By, R.Archana,10ucs05 D.Gayathri,10ucs11

Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names

No quality data, no quality mining results!


Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data

Multi-Dimensional Measure of Data Quality


A well-accepted multidimensional view:
Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility

Major Tasks in Data Preprocessing


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation

Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization
Part of data reduction but with particular importance, especially for numerical data

Forms of data preprocessing

Data Cleaning
Data cleaning tasks
Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data

Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to


equipment malfunction
inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data

Missing data may need to be inferred.

How to Handle Missing Data?


Ignore the tuple: usually done when class label is missing (assuming the

tasks in classificationnot effective when the percentage of missing values


per attribute varies considerably) Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

How to Handle Noisy Data?


Binning method:
first sort data and partition into (equi-depth) bins then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Clustering
detect and remove outliers

Combined computer and human inspection


detect suspicious values and check by human

Regression
smooth by fitting the data into regression functions

Data Integration
Data integration:
combines data from multiple sources into a coherent store

Schema integration
integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#

Detecting and resolving data value conflicts


for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units

Handling Redundant Data


Redundant data occur often when integration of multiple databases The same attribute may have different names in different databasesCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling

Data Transformation: Normalization min-max normalization v min v' (new _ max new _ min ) new _ min max min
A A A A A

z-score normalization v mean v' stand _ dev


A

normalization by decimal scaling


v v' j 10
Where j is the smallest integer such that Max(| v' |)<1

Data Reduction Strategies


Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction
Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results

Data reduction strategies


Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation

Data Cube Aggregation


The lowest level of a data cube
the aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.

Multiple levels of aggregation in data cubes


Further reduce the size of data to deal with

Reference appropriate levels


Use the smallest representation which is enough to solve the task

Dimensionality Reduction
Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns, easier to understand

Discretization
Three types of attributes:
Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers

Discretization:
divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis

Discretization and Concept hierachy


Discretization
reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

You might also like