0% found this document useful (0 votes)
5 views

Preprocessing

Uploaded by

aumarbe21
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Preprocessing

Uploaded by

aumarbe21
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Preprocessing

WHAT IS DATA?

Attributes
● Collection of data objects
and their attributes

● An attribute is a property or
characteristic of an object
– Examples: eye color of a person,
temperature, etc. Objects
– Attribute is also known as
variable, field, characteristic, or
feature
● A collection of attributes describe an
object

– Object is also known as record, point,


case, sample, entity, or instance
Data Types
• Data Types are important in statistics to correctly apply statistical
measurements.

• Required to know to apply for exploratory data analysis (EDA)


Qualitative Data
• Qualitative data is a bunch of information that cannot be
measured in the form of numbers. It is also known as
categorical data. It normally comprises words, narratives, and
we labeled them with names.

• It delivers information about the qualities of things in data. The


outcome of qualitative data analysis can come in the type of
featuring key words, extracting data, and ideas elaboration.

For examples:
Hair colour- black, brown, red
Opinion- agree, disagree, neutral
Quantitative Data
• Quantitative data is a bunch of information gathered from a
group of individuals and includes statistical data analysis.
Numerical data is another name for quantitative data.
• Simply, it gives information about quantities of items in the
data and the items that can be estimated. And, we can
formulate them in terms of numbers.

For examples:
We can measure the height (1.70 meters), distance (1.35 miles)
with the help of a ruler or tape.

We can measure water (1.5 litres) with a jug.


TYPES OF ATTRIBUTES
● There are different types of attributes
– Nominal
 Examples: ID numbers, eye color, zip codes

– Ordinal
 Examples: rankings (e.g., taste of potato chips on
a scale from 1-10), grades, height in {tall,
medium, short}

6
DISCRETE AND CONTINUOUS ATTRIBUTES
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits.
– Continuous attributes are typically represented as floating- point
variables.
– Interval
 Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time, counts 7
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
– Normalization (scaling to a specific range)
– Aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data discretization: with particular importance, especially for numerical data
– Data aggregation, dimensionality reduction, data compression,generalization
Forms of data preprocessing
Data Cleaning

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)

• Fill in the missing value manually: tedious + infeasible?


• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!

• Use the attribute mean to fill in the missing value


• Use the attribute mean for all samples of the same class to fill
in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
Noisy Data
• Q: What is noise?
• A: Random error in a measured variable.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
NOISE
● Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen

17

Two Sine Waves Two Sine Waves + Noise


How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
– used also for discretization
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer and
human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
OUTLIERS
● Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set

18
Cluster Analysis
DEVIATION/ANOMALY DETECTION

 Outliers are useful when we need to detect significant


deviations from normal behavior
 Applications:

⚫ Credit Card Fraud Detection

⚫ Network Intrusion Detection

19
day
Regression
y

Y1

Y1’ y=x+1

X1 x
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables, fit to a
multidimensional surface)
How to Handle Inconsistent Data?
• Manual correction using external references
• Semi-automatic using various tools
– To detect violation of known functional
dependencies and data constraints
– To correct redundant data
Data Integration
• Data integration:
– combines data from multiple sources into a coherent store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– possible reasons: different representations, different scales,
e.g., metric vs. British units, different currency
Handling Redundant Data in
Data Integration
• Redundant data occur often when integrating multiple DBs
– The same attribute may have different names in different databases
– One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Redundant data may be able to be detected by correlational
analysis
( A  A)( B  B )
rA, B 
(n  1) A B
• Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Data Transformation
• Smoothing: remove noise from data (binning,
clustering, regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones
Data Transformation: Normalization
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
• min-max normalization
v  min
v'  (new _ max  new _ min )  new _ min
A
A A A
max  min
A A

• z-score normalization
v  mean
v' 
A

stand _ dev A

• normalization by decimal scaling


v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10
Normalization: Examples
Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set

• Solution?
– Data reduction…
Data Reduction
•Obtains a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
•Data reduction strategies
–Dimensionality reduction
–Data compression
Dimensionality Reduction
• Problem: Feature selection (i.e., attribute subset selection):
– Select a minimum set of features such that the probability
distribution of different classes given the values for those features
is as close as possible to the original distribution given the values
of all features
– Nice side-effect: reduces # of attributes in the discovered patterns
(which are now easier to understand)
• Solution: Heuristic methods (due to exponential # of
choices) usually greedy:
– step-wise forward selection
– step-wise backward elimination
– combining forward selection and backward elimination
– decision-tree induction
Example of Decision Tree Induction
nonleaf nodes: tests
branches: outcomes of tests
leaf nodes: class prediction
Initial attribute set:
A4 ? {A1, A2, A3, A4, A5, A6}

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


Data Compression
• String compression
– There are extensive theories and well-tuned algorithms
– Typically lossless
– But only limited manipulation is possible without expansion
• Audio/video, image compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Data Compression

Original Data Compressed


Data
lossless

os sy
l
Original Data
Approximated
Principal Component Analysis (PCA)

• Given N data vectors from k-dimensions, find


c <= k orthogonal vectors that can be best used
to represent data
– The original data set is reduced (projected) to one
consisting of N data vectors on c principal components
(reduced dimensions)
• Each data vector is a linear combination of the c
principal component vectors
• Works for ordered and unordered attributes
• Used when the number of dimensions is large
Clustering
• Partition data set into clusters, and store cluster representation only
• Quality of clusters measured by their diameter (max distance
between any two objects in the cluster) or centroid distance (avg.
distance of each cluster object from its centroid)
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering (possibly stored in multi-
dimensional index tree structures (B+-tree, R-tree, quad-tree, etc))
• There are many choices of clustering definitions and clustering
algorithms
Sampling
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Cost of sampling: proportional to the size of the sample,
increases linearly with the number of dimensions
• Choose a representative subset of the data
– Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods
– Stratified sampling:
• Approximate the percentage of each class (or subpopulation of
interest) in the overall database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a time).
• Sampling: natural choice for progressive refinement of a
reduced data set.
Sampling

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
Sampling
Raw Data Cluster/Stratified Sample
SAMPLE SIZE

8000 points 2000 Points 500 Points

4
Splitting Datasets
• To use a dataset in Machine Learning, the dataset is
first split into a training and test set.

• The training set is used to train the model.

• The test set is used to test the accuracy of the model.

• Typically, split 80% training, 20% test.


It’s About Training
Machine Learning is about using data to train a
model
Split Dataset into Training and Test Use Training Data to Train the
Model

Trainin Trai
g n
Data Produce Model
DAT
A

Test Model

Data
Determine
Test the Accuracy Accuracy
Model
Serial Splitting of the Dataset
• Simplest method of splitting data is to split it
serially.

• Take first 80% rows and put into training set.


• Take remaining 20% rows and put into test set.
import pandas as pd # pandas library
dataset = # read in data as panda dataframe
pd.read_csv("Data.csv")
nrows = dataset.shape[ 0 ] # property shape[ 0 ] is the number of
rows

train = dataset.iloc[ 1: int(nrows * .8) , : ]


80% rows 20% all columns
rows
test = dataset.iloc[int(nrows * .8) +1,
nrows, : ]
Random Splitting of the Dataset
• Another method is too pick rows at
random.
• Sci-kit learn has built-in method
from sklearn.cross_validation import train_test_split

ncols = dataset.shape[ 1 ] # property shape[ 0 ] is the number of columns

# Assume label is last column in dataset


X = dataset.iloc[ :, :-1 ] # X is all the features (exclude last
y = dataset.iloc[ :, column) # Y is the label (last column)
ncols ]
# Split the data, with 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =
0)
seed for random
split size
Number generator
Data Imbalance - Overfitting
• If the training data is overly unbalanced, then the
model will predict a non-meaningful result.

• For example, if the model is a binary classifier (e.g.,


apple vs. pear), and nearly all the samples are of the
same label (e.g., apple), then the model will simply learn
that everything is a that label (apple).

• This is called overfitting. To prevent overfitting, there


needs to be a fairly equal distribution of training samples
for each classification, or range if label is a real value.
Data Imbalance - Overfitting
Nearly all samples are an Apple
In an imbalance, the model will fit itself to the imbalance, not the predictor.

Trainin
g Data
(90% Trai
Apple, n
10%
Pear)

Pear Model

Test Sample is a
The model will most likely
Pear
predict the pear as an apple.

Apple
Cross Validation
Repeat Process until Converge to Desired Accuracy
Use Training Data to Train the Model
Split Dataset into Training and Randomly pick split
Test

Training Trai
Data n
Training
Data

DAT Validatio
n Data Model
A

Test Use Validation


Data To predict
accuracy of the Accuracy
Data model.
Predict
Make adjustments to Accuracy
training method to improve
K-Fold Cross Validation
• K-Fold is a well-known form of cross validation.

• Steps:
1. Partition the dataset into k equal sized partitions.
2. Select one partition as the validation data.
3. Use the remaining k-1 as the training data.
4. Train the model and determine accuracy from
the validation data.
5. Repeat the process k times, selecting a different
partition each time for the validation data.
5.Average the accuracy results.
k-fold cross validation
break into n repeat for all parts/splits:
equal-sized parts train on n-1 parts evaluate on the other
Training data



split 1 split 2 split 3
k-fold cross validation
evaluate

split 1
score 1

split 2
… split 3 score 2

… score 3

… …

You might also like