0% found this document useful (0 votes)
16 views

Data preprocessing (1)

Uploaded by

Kranium A
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Data preprocessing (1)

Uploaded by

Kranium A
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Why Is Data Preprocessing Important?

• No quality data, no quality mining results!


• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading
statistics.

• Data preparation, cleaning, and transformation comprises


the majority of the work in a data mining application (90%).

2
Understanding Data Attribute Types
• For a customer object, attributes can be customer-id, address, etc
• First step before Data pre-processing - differentiate between different
types of attributes and then pre-process the data.

Data
attribute
Types

Qualitative Quantitative

Nominal Binary Ordinal Numeric


Qualitative Attributes
1. Nominal Attributes –
• Values are name of things or some kind of symbols
• Values of Nominal attributes represents some category or state and that’s why nominal
attribute also referred as categorical attributes and
• There is no order (rank, position) among values of nominal attribute
Qualitative Attributes

2. Binary Attributes : Binary data has only 2 values/states.


For Example yes or no, true or false.
i. Symmetric : Both values are equally important (Gender). No preference on which should
be coded as 0 or 1
ii. Asymmetric : Both values are not equally important. Most important outcome is coded
as 1
Qualitative Attributes

3. Ordinal Attributes :
• Values that have a meaningful sequence or ranking(order) between them
• But magnitude of values is not actually known
Quantitative Attributes
Numeric :
• A numeric attribute is quantitative because, it is a measurable quantity,
represented as integer or real values.
• Numerical attributes are of 2 types, interval and ratio.

i) Interval-scaled attributes
• Have order
• Values can be added and subtracted but cannot be multiplied or divided
• Eg. Temperature of 10 degree Celsius should not be considered as twice hot as 5
degree Celsius since 10 degree Celsius is 50 degree Fahrenheit and 5 degree Celsius
is 41 degree Fahrenheit which isn’t twice
Quantitative Attributes(Contd..)
• Interval data always appears in the form of numbers or numerical values
where the distance between the two points is standardized and equal
• Eg. difference between 100 degrees Fahrenheit and 90 degrees Fahrenheit
is the same as 60 degrees Fahrenheit and 70 degrees Fahrenheit.
• Don’t have a true zero

ii) Ratio-scaled attributes


• Has all properties of interval-scaled
• Have a true zero
• Values can be added , subtracted , multiplied & divided
• Eg. Weight, height,etc
Quantitative Attributes(Contd..)
• interval variables, ratio variables can be discrete or continuous.
• A discrete variable is expressed only in countable numbers (e.g., integers) while
a continuous variable can potentially take on an infinite number of values.
Exercise

• Classify the following attributes as discrete, or continuous. Also classify them as


qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases
may have more than one interpretation, so briefly indicate your reasoning if you
think there may be some ambiguity.

1. Number of telephones in your house


2. Size of French Fries (Medium or Large or X-Large)
3. Ownership of a cell phone
4. Number of local phone calls you made in a month
5. Length of longest phone call
6. Length of your foot
7. Price of your textbook
8. Zip code
9. Temperature in degrees Fahrenheit
10. Temperature in degrees Celsius
11. Temperature in Kelvin
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

• Data integration
• Integration of multiple databases, data cubes, or files

• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression

• Data transformation and data discretization


• Normalization
• Concept hierarchy generation

11
Data Cleaning Tasks

• Fill in missing values


• Identify outliers and smooth out
noisy data
• Correct inconsistent data
• Resolve redundancy caused by
data integration

13
Incomplete (Missing) Data
• Data is not always available
• Many tuples may not have recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• Equipment malfunction
• May not be available at the time of entry
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Data inconsistent with other recorded data may have been deleted
• Recording of data or its modifications may have been overlooked
• Missing data may need to be inferred
How to Handle Missing Data?

• Ignore the tuple: not effective unless tuple contains several attributes with
missing values
• Fill in the missing value manually: tedious + infeasible?
• Fill it automatically with
• A global constant : eg., “Unknown”
• The attribute mean or median
• The attribute mean for all samples belonging to the same class as the tuple:
smarter
• The most probable value: inference-based such as Bayesian formula or
decision tree
How to Handle Missing Data?
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which require data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data

17
How to Handle Noisy Data?

• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human (e.G., Deal with possible
outliers)
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34
Regression for Data Smoothing
• The regression functions are used to determine the
relationship between the dependent variable (target
field) and one or more independent variables. The
dependent variable is the one whose values you want to
predict, whereas the independent variables are the
variables that you base your prediction on.
• A RegressionModel defines three types of regression
models: linear, polynomial, and logistic regression.
• The modelType attribute indicates the type of
regression used.
• Linear and stepwise-polynomial regression are designed
for numeric dependent variables having a continuous
spectrum of values. These models should contain
For linear and stepwise regression, the regression formula exactly one regression table.
is:
Dependent variable = intercept + Sumi (coefficienti * • Logistic regression is designed for categorical dependent
independent variablei ) + error variables.
Clustering for Data Smoothing: Outlier Removal

• Data points inconsistent with the majority of data


• Different outliers
• Valid: CEO’s salary,
• Noisy: One’s age = 200, widely deviated points
• Removal methods
• Clustering
• Curve-fitting
• Hypothesis-testing with a given model
Exercise

• Suppose a group of 12 sales price records has been sorted as follows:


5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215 Partition them into three
bins solve it by each of the following methods:
(a) equi-depth partitioning
(b) Smoothing by bin boundaries

• Solution:
(a) equi-depth partitioning -Bin-1: 5, 10, 11, 13, Bin-2: 15, 35, 50, 55, Bin-03:
72, 92, 204, 215
(b) Smoothing by bin boundaries: Bin-1: 5,13,13,13 , Bin-2: 15,15,55,55, Bin-
3:72,72,215,215
Data Warehouse and Data Mining

• Introduction to KDD (Knowledge Discovery in database)


• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

Unit I- Data Preprocessing


23
Data Integration

The process of combining data from multiple sources into a single,


unified view.

Data integration techniques:


• Schema matching
• Instance conflict resolution
• Source selection
• Result merging
• Quality composition

24
24
Data Integration(Contd..)
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id ≡ B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Redundancy:
• Inconsistencies in attribute or dimensions naming can cause redundancy
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different Possible reasons: different representations, different scales, e.g.,
25
metric vs. British units 25
Handling Redundancy in Data Integration

Redundancy & Correlation Analysis:


• Redundant data occur often when integration of multiple databases
• Object identification: The same attribute or object may have different names in different
databases
• Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue

• Redundant attributes may be able to be detected by correlation analysis and


covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

26
26
Detection of Data Redundancy-Correlation

• Redundancies can be detected using following methods:


• χ2Test (Used for nominal Data or categorical or qualitative data)
• Correlation coefficient and covariance (Used for numeric Data or quantitative data)
• The term "correlation" refers to a mutual relationship or association
between quantities.
• In almost any business, it is useful to express one quantity in terms of its
relationship with others.
• For example, sales might increase when the marketing department spends more on
TV advertisements,
• Customer's average purchase amount on an e-commerce website might depend on
a number of factors related to that customer.
• Often, correlation is the first step to understanding these relationships and
subsequently building better business and statistical models.
Unit I- Data Preprocessing
27
Why is correlation a useful metric?

• Correlation can help in predicting one quantity from another


• Correlation can (but often does not, as we will see in some examples
below) indicate the presence of a causal relationship
• Correlation is used as a basic quantity and foundation for many other
modeling techniques
• More formally, correlation is a statistical measure that describes the
association between random variables.
• There are several methods for calculating the correlation coefficient,
each measuring different types of strength of association

Unit I- Data Preprocessing


28
Correlation Analysis (Numeric data )
• Evaluate correlation between 2 attributes, A & B, by computing correlation
coefficient(Pearson’s product moment coefficient)

Where n is the number of tuples, ai and bi are the respective values of A and
B in tuple i,
are the respective mean values of A and B,
are the respective standard deviations of A and B
is the sum of the AB cross-product (i.e., for each tuple, the value for A
is multiplied by the value for B in that tuple)

Unit I- Data Preprocessing


29
Correlation- Pearson Correlation Coefficient

• Pearson is the most widely used correlation coefficient.


• Pearson correlation measures the linear association between
continuous variables.
• Pearson correlation coefficient
• The Correlations coefficient is a statistic and it can range between +1
and -1

Unit I- Data Preprocessing


30
Correlation Analysis (Numeric data )

• Note that
• If resulting value is greater than 0, then A and B are positively correlated, meaning
that the values of A increase as the values of B increase
• Higher the value, the stronger the correlation
• Higher value may indicate that A (or B) may be removed as a redundancy
• If the resulting value is equal to 0, then A and B are independent and there is no
correlation between them
• If the resulting value is less than 0, then A and B are negatively correlated, where
the values of one attribute increase as the values of the other attribute decrease

Unit I- Data Preprocessing


31
Covariance Analysis(Numeric Data)
• Correlation and covariance are two similar measures for assessing how much two
attributes change together
• Consider two numeric attributes A and B, and a set of n observations

• Mean values of A and B, respectively, are also known as the expected values on A
and B, that is

• covariance between A and B is defined as

Unit I- Data Preprocessing


32
Covariance Analysis(Numeric Data)

• Also, for simplified calculations

• For two attributes A and B that tend to change together,


• if A is larger than , then B is likely to be larger than . Hence, the covariance between A
and B is positive
• if one of the attributes tends to be above its expected value when the other attribute is
below its expected value, then the covariance of A and B is negative
• covariance value is 0 means attributes are independent

Unit I- Data Preprocessing


33
Correlation

• Positive covariance: If CovA,B > 0, then A and B both tend to be


larger than their expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence

Unit I- Data Preprocessing


34
Covariance Analysis(Numeric Data)

• Table shows stock prices of two companies at five time points. If the stocks are
affected by same industry trends, determine whether their prices rise or fall
together?

Unit I- Data Preprocessing


35
Covariance Analysis(Numeric Data)

Unit I- Data Preprocessing


36
Correlation: example
An example of stock prices observed at five time points for
AllElectronics and HighTech, a high-tech company. If the• Suppose two stocks A and B have the following values in
stocks are affected by the same industry trends, will their one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
prices rise or fall together?
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?

Unit I- Data Preprocessing


37
Correlation: example
An example of stock prices observed at five time points for
AllElectronics and HighTech, a high-tech company. If the• Suppose two stocks A and B have the following values in
stocks are affected by the same industry trends, will their one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
prices rise or fall together?
• Question: If the stocks are affected by the same industry
trends, will their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 =


4

• Thus, A and B rise together since Cov(A, B) > 0.

Unit I- Data Preprocessing


38
Data Value Conflict Detection and Resolution,
• For the same real-world entity, attribute values from different sources
may differ
• Eg. Prices of rooms in different cities may involve different currencies
• Attributes may also differ on the abstraction level, where an attribute
in one system is recorded at, say, a lower abstraction level than the
“same” attribute in another.
• Eg. total sales in one database may refer to one branch of All_Electronics,
while an attribute of the same name in another database may refer to the
total sales for All_Electronics stores in a given region.
• To resolve, data values have to be converted into consistent form

Unit I- Data Preprocessing


39
Data Transformation

• Smoothing: remove noise from data (binning, clustering, regression)


• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• Min-max normalization
• Z-score normalization
• Normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
• E.g. add attribute area based on the attributes height and width

40
40
Data Transformation: Normalization

• Min-max normalization

• Z-score normalization

• Normalization by decimal scaling

Where j is the smallest integer such that Max(| |)<1

Unit I- Data Preprocessing


41
Data Transformation: Min Max-Normalization
• Min-max normalization: to [new_minA, new_maxA]
• Performs a linear transformation on the original data.
• Suppose that mina and maxa are the minimum and maximum values of an attribute,
a.
• Min-max normalization maps a value, vi , of a to the range [new_mina, new_maxa] by
computing

• Eg. Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. We would like to map income to the range [0.0, 1.0]. By min-max
normalization, a value of $73,600 for income is transformed to 0.716
Unit I- Data Preprocessing
42
Data Transformation: Z-score Normalization

• Z-score normalization (uses mean & σ: standard deviation):


• Values for an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A

• Eg. Suppose that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively.

Unit I- Data Preprocessing


43
Data Transformation: Decimal scaling Normalization

• Suppose that the recorded values of A ranges from -986 to 917


• The maximum absolute value of A is 986
• To normalize by decimal scaling, we therefore divide each value by
1000
• So that -986 normalize to -0.986 and
• 917 normalize to -0.917

Unit I- Data Preprocessing


44
Exercise

• Define Normalization.
• What is the value range of min-max. Use min-max normalization to
normalize the following group of data: 8,10,15,20.

• Solution:
Marks Marks after Min-
Max normalization
8
10
15
20
Exercise

• Define Normalization.
• What is the value range of min-max. Use min-max normalization to
normalize the following group of data: 8,10,15,20.

• Solution:
Marks Marks after Min-
Max normalization
8 0
10 0.16
15 0.58
20 1
Data Mining

• Introduction to KDD
• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

your date here


Unit Title Here
47
Data Reduction Strategies

• Data is too big to work with


• Data reduction
• Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
• Data reduction strategies
• Dimensionality reduction — remove unimportant attributes
• Aggregation and clustering
• Sampling

48
Data Reduction Strategies

• Obtains a reduced representation of the data set that is much smaller


in volume but yet produces the same (or almost the same) analytical
results
• Data reduction strategies
• Data cube aggregation
• Dimensionality reduction
• Data compression
• Numerosity reduction
• Discretization and concept hierarchy generation

49
Data Cube Aggregation

• Multiple levels of aggregation in data cubes


• Further reduce the size of data to deal with

• Reference appropriate levels


• Use the smallest representation capable to solve the
task
• Queries regarding aggregated information
should be answered using data cube, when
possible

Unit I- Data Preprocessing


50
Data Cube Aggregation

Unit I- Data Preprocessing


51
Data Reduction Strategies

• Data reduction strategies:


• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression

52
Dimensionality reduction :Attribute Subset Selection
• Reduces the data set size by removing irrelevant or redundant attributes (or
dimensions)
• Goal of attribute subset selection is to find a minimum set of attributes
• Improves speed of mining as dataset size is reduced
• Mining on a reduced data set also makes the discovered pattern easier to
understand
• Redundant attributes
• Duplicate information contained in one or more attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' telephone number is often irrelevant to the task of predicting students' CGPA

53
Attribute Subset Selection

Best and worst attributes are determined using tests of statistical


significance
• A test of significance is a formal procedure for comparing observed data with a
claim (also called a hypothesis), the truth of which is being assessed
• Claim is a statement about a parameter
• Results of a significance test are expressed in terms of a probability that
measures how well the data and the claim agree

54
Heuristic(Greedy) methods for attribute subset selection

1. Stepwise Forward Selection:


• Starts with an empty set of attributes as the reduced set
• Best of the relevant attributes is determined and added to the reduced set
• In each iteration, best of remaining attributes is added to the set
2. Stepwise Backward Elimination:
• Here all the attributes are considered in the initial set of attributes
• In each iteration, worst attribute remaining in the set is removed
3. Combination of Forward Selection and Backward Elimination:
• Stepwise forward selection and backward elimination are combined
• At each step, the procedure selects the best attribute and removes the worst from among the
remaining attributes

55
Heuristic(Greedy) methods for attribute subset selection(cont)

4. Decision Tree Induction:


• This approach uses decision tree for attribute selection.
• It constructs a flow chart like structure having nodes denoting a test on an attribute.
• Each branch corresponds to the outcome of test and leaf nodes is a class prediction.
• The attribute that is not the part of tree is considered irrelevant and hence discarded

56
Example of Decision Tree Induction
Data Reduction 2: Numerosity Reduction

• Parametric methods
• Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
• E.g.: Log-linear models: obtain value at a point in m-D space as the product on
appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling

Unit I- Data Preprocessing


58
Regression and Log-Linear Models

• Linear regression: Data are modeled to fit a straight line: Y = α + β


X
• Often uses the least-square method to fit the line
• Multiple regression: allows a response variable y to be modeled as a
linear function of multidimensional feature vector (predictor
variables) Y = b0 + b1 X1 + b2 X2.
• Log-linear model: approximates discrete multidimensional joint
probability distributions p(a, b, c, d) = αab βacχad δbcd

Unit I- Data Preprocessing


59
Histograms

• Histograms (or frequency histograms) are at least a century old and are widely used.
• Plotting histograms is a graphical method for summarizing the distribution of a given attribute, X.
• Height of the bar indicates the frequency (i.e., count) of that X value
• Range of values for X is partitioned into disjoint consecutive subranges.
• Subranges, referred to as buckets or bins, are disjoint subsets of the data distribution for X.
• Range of a bucket is known as the width
• Typically, the buckets are of equal width.
• Eg. a price attribute with a value range of $1 to $200 can be partitioned into subranges 1 to 20, 21
to 40, 41 to 60, and so on.
• For each subrange, a bar is drawn with a height that represents the total count of items observed
within the subrange
60
Histograms

• Approximate data distributions


• Divide data into buckets and store
average (sum) for each bucket
• A bucket represents an attribute-
value/frequency pair
• Can be constructed optimally in
one dimension using dynamic
programming
• Related to quantization problems.

Unit I- Data Preprocessing


61
Histogram Analysis- Explanation & Example

Unit I- Data Preprocessing


62
Histogram Analysis- Explanation & Example

• The following data are a list of


prices of commonly sold items at
AllElectronics (rounded to the
nearest dollar).
• The numbers have been sorted:
• 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10,
10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21,
25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

Unit I- Data Preprocessing


63
Histogram of an image : Application

As you can see from the graph, that most of the bars that have high frequency lies in the first half
portion which is the darker portion. That means that the image we have got is darker

64
Data Compression

• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless
• But only limited manipulation is possible without expansion
• Audio/video, image compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
• Time sequence is not audio
• Typically short and vary slowly with time

Unit I- Data Preprocessing


65
Data Compression

Original Data Compressed


Data
lossless

sy
los
Original Data
Approximated

Unit I- Data Preprocessing


66
Clustering

• Partition data set into clusters, and store cluster representation only
• Quality of clusters measured by their diameter (max distance
between any two objects in the cluster) or centroid distance (avg.
distance of each cluster object from its centroid)
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering (possibly stored in multi-dimensional
index tree structures (B+-tree, R-tree, quad-tree, etc))
• There are many choices of clustering definitions and clustering
algorithms (further details later)

Unit I- Data Preprocessing


67
Sampling
• Allow a mining algorithm to run in complexity that is potentially sub-linear to the
size of the data
• Cost of sampling: proportional to the size of the sample, increases linearly with
the number of dimensions
• Choose a representative subset of the data
• Simple random sampling may have very poor performance in the presence of skew
• Develop adaptive sampling methods
• Stratified sampling:
• Approximate the percentage of each class (or subpopulation of interest) in the overall
database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a time).
• Sampling: natural choice for progressive refinement of a reduced data set.
68
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data

69
Types of Sampling

OR
SR S W
r a n d om
im p le hout
(s w it
le
samp ement)
c
repla

SRSW
R

Raw Data

70
Sampling

Raw Data Cluster/Stratified Sample

Unit I- Data Preprocessing


71
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

72
Discretization

• Three types of attributes:


• Nominal — values from an unordered set
• Ordinal — values from an ordered set
• Continuous — real numbers
• Discretization/Quantization:
● divide the range of a continuous attribute into intervals
x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6

• Some classification algorithms only accept categorical attributes.


• Reduce data size by discretization
• Prepare for further analysis

73
Discretization and Concept Hierarchies

• Discretization
• reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels can
then be used to replace actual data values.
• Concept Hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).

74
Discretization and Concept Hierarchies : Numerical data

• Hierarchical and recursive decomposition using:


• Binning (data smoothing)

• Histogram analysis (numerosity reduction)

• Clustering analysis (numerosity reduction)

• Entropy-based discretization

• Segmentation by natural partitioning

75
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)

76
Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

77
Reference

• Data Mining: Concepts and Techniques, Jiawei Han, Micheline


Kamber, and Jian Pei, 3rd edition

Unit I- Data Preprocessing


78

You might also like