0% found this document useful (0 votes)
310 views8 pages

Data Pre-Processing Techniques Explained

This document discusses data pre-processing techniques. It covers topics like data cleaning, transformation, reduction, and discretization. Data cleaning involves tasks like handling missing data by filling it in or smoothing noisy data. Data transformation includes normalization and aggregation. Data reduction aims to reduce the volume of data while maintaining similar analytical results, using methods like cube aggregation, attribute selection, and principal component analysis. The document emphasizes that high quality data preparation is important for data warehousing and mining.

Uploaded by

towsif.imran.dhk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
310 views8 pages

Data Pre-Processing Techniques Explained

This document discusses data pre-processing techniques. It covers topics like data cleaning, transformation, reduction, and discretization. Data cleaning involves tasks like handling missing data by filling it in or smoothing noisy data. Data transformation includes normalization and aggregation. Data reduction aims to reduce the volume of data while maintaining similar analytical results, using methods like cube aggregation, attribute selection, and principal component analysis. The document emphasizes that high quality data preparation is important for data warehousing and mining.

Uploaded by

towsif.imran.dhk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 2- Data Pre-Processing

Topic Learning Outcomes

At the end of this topic, you should be able to:

1. Explain the different forms of data pre-processing

2. Apply the different types of data pre-processing appropriately

Contents & Structure

• Data preprocessing

• Data cleaning

• Data transformation

• Data reduction

Data Preprocessing
• Data cleaning

– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies

• Data integration

– Integration of multiple databases, data cubes, or files

• Data transformation

– Normalization and aggregation

• Data reduction

– Obtains reduced representation in volume but produces the same or similar analytical
results

• Data discretization

– Part of data reduction but with particular importance, especially for numerical data
Forms of Data Preprocessing

Why Data Preprocessing?


• Data in the real world is dirty

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data

• e.g., occupation=“ ”

– noisy: containing errors or outliers

• e.g., Salary=“-10”

– inconsistent: containing discrepancies in codes or names

• e.g., Age=“42” Birthday=“03/07/1997”

• e.g., rating “1,2,3”, now rating “A, B, C”

• e.g., discrepancy between duplicate records

Data Cleaning
• Importance

– “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball

– “Data cleaning is the number one problem in data warehousing”—DCI survey

• Data cleaning tasks

– Fill in missing values

– Identify outliers and smooth out noisy data

– Correct inconsistent data

– Resolve redundancy caused by data integration


Missing Data

• Data is not always available


– E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
• Missing data may need to be inferred.

How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing


Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class
– the most probable value: inference-based such as Bayesian formula or decision tree

Noisy Data

• Noise: random error or variance in a measured variable


• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data

How to Handle Noisy Data?

• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible outliers)
Binning Methods for Data Smoothing

❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

Cluster Analysis

Regression
Data Transformation
Data Transformation: Normalization

Data Reduction
Data Reduction Strategies

• Data reduction
– Obtain a reduced representation of the data set that is much smaller in volume but yet
produce the same (or almost the same) analytical results.
• Why data reduction?
– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run on the complete data set
• Data reduction strategies
– Data cube aggregation:
– Attribute subset selection
– Principal Component Analysis.
Data Cube Aggregation

• The lowest level of a data cube (base cuboid)

– The aggregated data for an individual entity of interest

– E.g., a customer in a phone calling data warehouse

• Multiple levels of aggregation in data cubes

– Further reduce the size of data to deal with

• Reference appropriate levels

– Use the smallest representation which is enough to solve the task

• Queries regarding aggregated information should be answered using data cube, when possible
A Sample Data Cube

Attribute Subset Selection

• Feature selection (i.e., attribute subset selection):

– Select a minimum set of features such that the probability distribution of different classes
given the values for those features is as close as possible to the original distribution given
the values of all features

– reduce # of patterns in the patterns, easier to understand

• Heuristic methods (due to exponential # of choices):

– Step-wise forward selection

– Step-wise backward elimination

– Combining forward selection and backward elimination

– Decision-tree induction
Example of Decision Tree Induction

Dimensionality Reduction: Principal Component Analysis (PCA)


• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that
can be best used to represent data
• Steps
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component vectors
– The principal components are sorted in order of decreasing “significance” or strength
– Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance. (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data
• Works for numeric data only
• Used when the number of dimensions is large

Principal Component Analysis


Review Questions

• Why to pre-process data?


• How to handle missing data?
• How to handle noisy data?
• When should you employ data reduction?

Summary / Recap of Main Points

• Data preparation or preprocessing is a big issue for both data warehousing and data mining
• Descriptive data summarization is need for quality data preprocessing
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot a methods have been developed but data preprocessing still an active area of research

Common questions

Powered by AI

Data reduction strategies include data cube aggregation, attribute subset selection, and Principal Component Analysis (PCA). Data cube aggregation reduces the dataset size through multi-level aggregation steps, making it easier to manage . Attribute subset selection involves choosing a minimal set of relevant features, simplifying the dataset while maintaining its analytical power . PCA reduces dimensionality by transforming data into fewer orthogonal components that retain most of the original variance . Together, these approaches decrease data volume but preserve the integrity of analysis, making it more feasible to handle large datasets efficiently.

Data cleaning addresses missing data by ignoring tuples when the class label is missing or filling in values using various strategies such as global constants, means, or inference-based methods like Bayesian formulas . Noisy data is tackled through binning, regression, clustering, and combined computer-human inspection . It is a major challenge because real-world data is often incomplete and noisy, as noted by Ralph Kimball and a DCI survey that cite data cleaning as one of the biggest issues in data warehousing due to its complexity and the extent to which it affects data analysis quality .

The main forms of data preprocessing include data cleaning, data integration, data transformation, and data reduction. Data cleaning involves filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies, which enhance data quality and reliability . Data integration combines data from multiple sources, ensuring consistency and comprehensive datasets . Data transformation involves normalization and aggregation to standardize and summarize data, respectively, facilitating comparison and analysis . Lastly, data reduction decreases data volume while maintaining analytical results, making the analysis more efficient and manageable . Together, these processes prepare raw data for more accurate and meaningful analysis.

PCA facilitates dimensionality reduction by transforming data into a set of orthogonal components ordered by their variance. By selecting only the components with the highest variance, PCA reduces the dimensionality of the dataset while retaining as much variability as possible . It works well for numeric data and is beneficial when dealing with high-dimensional spaces. However, its limitation is that it cannot be used for non-numeric data and might lose interpretability since the derived components might not have a clear meaning . Additionally, PCA assumes linear relationships, which may not be suitable for all types of data analysis.

Clustering helps in handling noisy data by grouping data points into clusters and detecting outliers as those which don’t fit well into any cluster, which can then be removed or re-evaluated . In contrast, binning sorts data and smooths it using techniques like bin means, mediated modifications to detected groups; regression fits data into a model, smoothing the noise concerning a deterministic relationship . Unlike binning and regression, clustering focuses more on natural data structure and relationships for noise reduction, which may offer more flexibility in identifying and managing anomalies.

Missing data can be handled automatically using methods such as filling with a global constant, the attribute mean, the attribute mean for specific classes, or the most probable value inferred via Bayesian formulas or decision trees . While these methods can efficiently address missing values, they have potential drawbacks such as introducing bias (e.g., mean imputation can distort statistical relationships) and masking underlying issues (e.g., consistently missing values might indicate a systematic problem). Moreover, inferred values may not reflect the true distribution, potentially leading to inaccurate analyses.

Human inspection in data cleaning benefits from leveraging human judgment and intuition to validate suspicious data values, particularly in noisy areas that automated systems might either mislabel or overlook . It allows for the detection of subtle complexities and nuances that automated techniques might miss. However, it poses significant challenges such as being time-consuming, financially costly, and prone to human error or bias . Additionally, it might not be scalable for large datasets, thus requiring a combination with automated methods to be effective on substantial data collections.

Data reduction is particularly necessary when dealing with extremely large datasets that make complete analysis either infeasible or excessively time-consuming, such as databases storing terabytes of data . Using a data cube offers benefits such as the capability to answer queries more efficiently by utilizing aggregated information already structured at multiple levels. This allows for quicker data retrieval and smarter storage management, ultimately speeding up the analysis process while maintaining the essential analytical results .

Data discretization is a crucial part of data reduction that focuses specifically on numerical data. It involves segmenting continuous data into intervals, making it easier to analyze by reducing data complexity . This simplification allows the use of more straightforward analytical methods and improves performance in tasks like classification and regression by transforming continuous attributes into categorical form. It is important for numerical data as it can reveal trends and patterns not easily apparent with raw continuous data .

Normalization in data transformation standardizes data ranges, which makes different datasets more easily comparable and improves the efficiency of data analysis by reducing skewness and making computation less intensive . Aggregation involves summarizing attribute data which reduces the dataset's complexity and size without compromising on analytical quality, thereby speeding up processing time and improving the efficiency of data analysis tasks . These methods streamline datasets which otherwise could be cumbersome due to size and variability.

You might also like