0% found this document useful (0 votes)
50 views

ML Unit-Ii

This document discusses data preprocessing techniques. It covers data objects and attributes, statistical descriptions of data, and various preprocessing tasks like data cleaning, integration, and reduction. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques such as filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data reduction reduces dimensionality and numerosity to simplify data through methods like PCA and compression.

Uploaded by

Supriya alluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

ML Unit-Ii

This document discusses data preprocessing techniques. It covers data objects and attributes, statistical descriptions of data, and various preprocessing tasks like data cleaning, integration, and reduction. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques such as filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data reduction reduces dimensionality and numerosity to simplify data through methods like PCA and compression.

Uploaded by

Supriya alluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

UNIT-II

Know the data


and
Data Preprocessing
UNIT-2
Know the Data and Data Preprocessing: Data Objects and
attribute types, Basic statistical description of Data, Data preprocessing,
Data cleaning, Data Integration and Data reduction. Main Approaches
for Dimensionality Reduction, Projection, Manifold Learning, PCA.
Insufficient Quantity of Training Data, Nonrepresentative Training Data,
Poor-Quality Data, Irrelevant Features, Overfitting the Training Data,
Underfitting the Training Data, Stepping Back, Testing and Validating.
Data Object
❑Data sets are made up of data objects.
❑A data object represents an entity.
❑Examples:
❑ sales database: customers, store items, sales
❑ medical database: patients, treatments
❑ university database: students, professors, courses
❑Also called samples , examples, instances, data points, objects, data tuples.
❑Data objects are described by attributes.
Data Objects

❑An attribute is a property or characteristic or feature of a data object.


❑ Examples: eye color of a person, temperature, etc.

❑Attribute is also known as variable, field, characteristic, or feature


❑A collection of attributes describe an object.
❑Attribute values are numbers or symbols assigned to an attribute

❑Database rows -> data objects; columns ->attributes.


Data Objects

Database rows → data objects


Database columns → attributes
Attributes
❑ Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
❑E.g., customer _ID, name, address
❑ Attribute values are numbers or symbols assigned to an attribute.

❑ Distinction between attributes and attribute values


❑ Same attribute can be mapped to different attribute values

❑Example: height can be measured in feet or meters


❑ Different attributes can be mapped to the same set of values

❑Example: Attribute values for ID and age are integers


❑But properties of attribute values can be different; ID has no limit, but
age has a maximum and minimum value
Attribute Types

❖NOMINAL ( “relating to names”)

❖BINARY (only two categories or states)

❖ORDINAL (Order or Ranking)

❖NUMERIC (Measurable quantity)

❖DISCRETE

❖CONTINUOUS
Attribute Types

❑Categorical (Qualitative)
❑ Nominal and Ordinal attributes are collectively referred to as
categorical or qualitative attributes.

❑Numeric (Quantitative)
❑ Interval and Ratio are collectively referred to as quantitative
or numeric attributes.

❑Discrete vs Continuous attributes


Attribute Types
Nominal: categories, states, or “names of things”, “Symbols”.
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ Marital status, occupation, ID numbers, zip codes
Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
e.g., gender
◼ Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
◼ Values have a meaningful order (ranking) but magnitude between successive values
is not known.
◼ Size = {small, medium, large}, grades, army rankings 11
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
◼ E.g., temperature in C˚ or F˚, calendar dates

No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger
than the unit of measurement (10 K˚ is twice as high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts, monetary

quantities
Attribute Types
(Discrete Vs Continuous Attribute)
Discrete Attribute
◼ Has only a finite or countably finite set of values
◼ zip codes, profession, or the set of words in a collection of documents

◼ Sometimes, represented as integer variables


◼ Note: Binary attributes are a special case of discrete attributes
◼ Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
Continuous Attribute
◼ Has real numbers as attribute values
◼ Temperature, height, or weight

◼ Practically, real values can only be measured and represented using a finite number of
digits
◼ Continuous attributes are typically represented as floating-point variables
Basic statistical description of data

❑Basic statistical descriptions can be used to


identify properties of the data and highlight
which data values should be treated as noise
or outliers.

❑For data preprocessing tasks, we want to learn


about data characteristics regarding both
central tendency and dispersion of the data.
Measures of central tendency include mean, median, mode,
and midrange.
Measures of data dispersion include quartiles, interquartile
range (IQR), and variance.
These descriptive statistics are of great help in understanding
the distribution of the data.
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

23
February 27, 2023 Data Mining: Concepts and Techniques
Dispersion
measures the
extent to which
the items vary
from central
value.

Also called as
spread out,
scatter, variance.
Data Preprocessing
Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

34
34
Data Quality: Why Preprocess the Data?

• Accuracy: correct or wrong, accurate or not


• Completeness: not recorded, unavailable
Measures • Consistency: some modified but some not,
for data dangling, …
quality: A • Timeliness: timely update?
multidimens • Believability: how trustable the data are
ional view correct?
• Interpretability: how easily the data can be
understood?

35
Major Techniques/ Tasks in Data Preprocessing

Data cleaning Data Data Data


integration reduction transformation and
• Fill in missing
values, smooth
Data discretization
• Integration of • Dimensionality
noisy data, identify multiple reduction • Normalization
or remove outliers, databases, data • Concept hierarchy
and resolve • Numerosity
cubes, or files reduction generation
inconsistencies
• Data compression

36
Forms of data preprocessing
Data Preprocessing
Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

38
38
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data
Ex. instrument faulty, human or computer error, transmission error

Intentional
Incomplete: Noisy: Inconsistent: e.g. disguised missing
data)
• lacking attribute • containing noise, • containing • Jan. 1 as everyone’s
values, lacking errors, or discrepancies in codes birthday?
or names, e.g.,
certain attributes outliers • Age=“42”,
of interest, or • e.g., Birthday=“20/03/2010”
containing only Salary=“−10” (an • Was rating “1, 2, 3”,
aggregate data error) now rating “A, B, C”
• e.g. Occupation=“ • discrepancy between
” (missing data) duplicate records
39
Incomplete (Missing) Data
Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
• Equipment malfunction
• Inconsistent with other recorded data and thus deleted
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Not register history or changes of the data

Missing data may need to be inferred


40
How to Handle Missing Data?
Ignore the tuple : usually
done when class label is missing Fill in the missing value
(when doing classification)—not manually:
effective when the % of missing
values per attribute varies tedious + infeasible?
considerably

Fill in it automatically with


• a global constant : e.g., “unknown”, a new class?!
• The attribute mean
• The attribute mean for all samples belonging to the
same class: smarter
• The most probable value: inference-based such as
Bayesian formula or decision tree
41
Noisy Data
Incorrect attribute Other data problems
Noise: random error values may be due which require data
to cleaning
or variance in a
measured variable faulty data collection
instruments duplicate records

data entry problems


incomplete data

data transmission
problems inconsistent data

Technology limitation

Inconsistency in
naming convention
43
How to Handle Noisy Data?
• First sort data & partition into (equal-frequency) bins
• Then one can smooth by bin means,
Binning
• smooth by bin median,
• smooth by bin boundaries, etc.

• Data smooth can also be done regression functions


Regression • Linear Regression
• Multiple linear Regression

• Place data elements in their similar groups as Clusters.


Clustering • Detect and remove outliers

Combined computer • Detect suspicious values and check by human (e.g.,


and human inspection deal with possible outliers)
44
Data Preprocessing

Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

46
Data Integration
• Combines data from multiple sources into a
Data integration: coherent store

Entity identification • Identify real world entities from multiple data


problem: sources, Ex. Bill Clinton = William Clinton

• For the same real-world entity, attribute values


Detecting and resolving from different sources are different
data value conflicts • Possible reasons: different representations,
different scales, e.g., metric vs. British units

• Object Identification.
Redundancy • Derivable data
48
Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
Handling • Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundancy
in Data Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Integration
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
52
Χ2 Correlation Test (Nominal Data)
Χ2 (chi-square) test

(Observed − Expected) 2
 =
2

Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
◼ # of hospitals and # of car-theft in a city are correlated
◼ Both are causally linked to the third variable: population
56
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts


calculated based on the data distribution in the two categories)

( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840
It shows that like_science_fiction and play_chess are correlated in the group
57
Data Reduction Strategies
Data reduction strategies
Why data
Data reduction? • Dimensionality reduction:
(remove unimportant attributes)
reduction: -Increase storage • Wavelet transforms
Data reduction efficiency • Principal Components Analysis
is a process - Performance (PCA)
that reduces (Complex data analysis • Feature subset selection, feature
the volume of may take a very long creation
original data time to run on the
complete data set.)
• Numerosity reduction: (some
and represents simply call it: Data Reduction)
it in a much -Reduce storage • Regression and Log-Linear
smaller Cost Models
volume. • Histograms, clustering, sampling
• Data cube aggregation
• Data compression
Principal Component Analysis (PCA)
Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality
reduction in machine learning.
It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.
It is one of the popular tools that is used for exploratory data analysis and predictive modeling. It is a
technique to draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.

x2

75
x1
Step 2: Step 3:
Step 1: Calculate the Calculate the
Standardize the covariance matrix eigenvalues and
dataset. for the features in eigenvectors for the
the dataset. covariance matrix.

Step 4:
Step 5:
Step 6: Sort eigenvalues
Pick k eigenvalues
Transform the and their
and form a matrix
original matrix. corresponding
of eigenvectors.
eigenvectors.
Regression Analysis y
Y1
Regression analysis:
Y1’ y=x+1
◼ Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor)
X1 x
variables with one or more independent variables.
Used for prediction
◼ More specifically, Regression analysis helps us to understand how (including forecasting
the value of the dependent variable is changing corresponding to an of time-series data),
inference, hypothesis
independent variable when other independent variables are held testing, and modeling
fixed. of causal relationships

◼ It predicts continuous/real values such as temperature, age, salary,


House price, etc.
The parameters are estimated so as to give a "best fit" of the data
78
Regress Analysis and Log-Linear Models

Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be estimated by using the
data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above
Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Estimate the probability of each point (tuple) in a multi-dimensional space for a set of
discretized attributes, based on a smaller subset of dimensional combinations
◼ Useful for dimensionality reduction and data smoothing 79
Histogram Analysis
40
35
30
25
20
15
10
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
Divide data into buckets and store
average (sum) for each bucket
Partitioning rules:
◼ Equal-width: equal bucket range
◼ Equal-frequency (or equal-depth)
80
Clustering

Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms

81
Sampling

Sampling: obtaining a small sample s to represent the whole data set N


Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew
◼ Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)

82
Types of Sampling

Simple random Sampling with Sampling without


Stratified sampling:
sampling replacement replacement

• There is an • Once an • A selected • Partition the data


set, and draw
equal object is object is not samples from
probability of selected, it is removed each partition
selecting any removed from the (proportionally,
particular from the population i.e.,
approximately the
item population same percentage
of the data)
• Used in
conjunction with
skewed data
83
Sampling: With or without Replacement

Raw Data
84
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

85
What Is Wavelet Transform?
Decomposes a signal into different
frequency sub bands
◼ Applicable to n-dimensional signals
Data are transformed to preserve
relative distance between objects at
different levels of resolution
Allow natural clusters to become
more distinguishable
Used for image compression

88
Wavelet Transformation
Discrete wavelet transform (DWT) for linear signal
Haar2 Daubechie4
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
▪ Length, L, must be an integer power of 2 (padding with 0’s, when necessary)
▪ Each transform has 2 functions: smoothing, difference
▪ Applies to pairs of data, resulting in two set of data of length L/2
▪ Applies two functions recursively, until reaches the desired length
89
Why Wavelet Transform?

Use hat-shape
filters Effective
Multi-
• Emphasize removal of
region where resolution
outliers
points cluster • Detect
• Insensitive to arbitrary Only
• Suppress noise,
weaker shaped applicable to Efficient
insensitive to clusters at
information in input order low Complexity
their different
scales dimensional O(N)
boundaries
data

91
Data Preprocessing

Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary
92
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values i.e.
each old value can be identified with one of the new values
Data transformation is a process of converting data from one format or structure into another format or
structure.

Aggregation: Normalization: Scaled Discretization:


Smoothing: Remove Attribute/feature Summarization, to fall within a smaller, Concept hierarchy
noise from data construction data cube specified range climbing
construction
min-max
New attributes normalization
constructed from
the given ones
z-score
normalization

normalization by
decimal scaling
Normalization
Min-max normalization: to [new_minA, new_maxA]

v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
◼ v−
Ex. Let μ = 54,000, σ = 16,000. A
Then
v' =
Normalization by decimal scaling  A
73,600 − 54,000
= 1.225
16,000

v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
94
Discretization

Three types of attributes

Nominal—values from an Ordinal—values from an


unordered set Numeric—real numbers,
ordered set,
e.g., integer or real numbers
e.g., color, profession e.g., military or academic rank

96
Discretization: Divide the
range of a continuous attribute
into intervals

Interval labels can Prepare for further


then be used to Supervised vs. Split (top-down) vs.
analysis, e.g.,
replace actual data unsupervised merge (bottom-up)
classification
values

Reduce data size by


discretization

Discretization can be
performed recursively on
an attribute
Data Discretization Methods

Data Discretization
methods
( All the methods can be applied
recursively)

Histogram Clustering analysis Decision-tree Correlation (e.g., 2)


Binning (unsupervised, top-down analysis analysis
analysis (supervised, top-
split or bottom-up merge) (unsupervised,
down split) bottom-up merge)

Top-down
split, Top-down split,
unsupervised unsupervised

98
Simple Discretization: Binning

Equal-width (distance) Equal-depth (frequency)


partitioning partitioning
• Divides the range into N • Divides the range into N
intervals of equal size: uniform intervals, each containing
grid approximately same number of
• if A and B are the lowest and samples
highest values of the attribute, • Good data scaling
the width of intervals will be: W • Managing categorical attributes
= (B –A)/N. can be tricky
• The most straightforward, but
outliers may dominate
presentation
• Skewed data is not handled well
99
Equal width vs Equal depth binning
Challenges of ML
Insufficient Quantity
of training data
Data
Mismatch Nonrepresentative
training data
Hyperparameter
Poor – Quality
tuning and Model
of data
selection

Testing and Irrelevant


Validating features

Stepping Back Overfitting the


Training data

Underfitting the
Training data 102
103
104
105
10
6

BY PUNNA RAO

You might also like