Data preprocessing (1)
Data preprocessing (1)
2
Understanding Data Attribute Types
• For a customer object, attributes can be customer-id, address, etc
• First step before Data pre-processing - differentiate between different
types of attributes and then pre-process the data.
Data
attribute
Types
Qualitative Quantitative
3. Ordinal Attributes :
• Values that have a meaningful sequence or ranking(order) between them
• But magnitude of values is not actually known
Quantitative Attributes
Numeric :
• A numeric attribute is quantitative because, it is a measurable quantity,
represented as integer or real values.
• Numerical attributes are of 2 types, interval and ratio.
i) Interval-scaled attributes
• Have order
• Values can be added and subtracted but cannot be multiplied or divided
• Eg. Temperature of 10 degree Celsius should not be considered as twice hot as 5
degree Celsius since 10 degree Celsius is 50 degree Fahrenheit and 5 degree Celsius
is 41 degree Fahrenheit which isn’t twice
Quantitative Attributes(Contd..)
• Interval data always appears in the form of numbers or numerical values
where the distance between the two points is standardized and equal
• Eg. difference between 100 degrees Fahrenheit and 90 degrees Fahrenheit
is the same as 60 degrees Fahrenheit and 70 degrees Fahrenheit.
• Don’t have a true zero
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
11
Data Cleaning Tasks
13
Incomplete (Missing) Data
• Data is not always available
• Many tuples may not have recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• Equipment malfunction
• May not be available at the time of entry
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry
• Data inconsistent with other recorded data may have been deleted
• Recording of data or its modifications may have been overlooked
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple: not effective unless tuple contains several attributes with
missing values
• Fill in the missing value manually: tedious + infeasible?
• Fill it automatically with
• A global constant : eg., “Unknown”
• The attribute mean or median
• The attribute mean for all samples belonging to the same class as the tuple:
smarter
• The most probable value: inference-based such as Bayesian formula or
decision tree
How to Handle Missing Data?
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which require data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data
17
How to Handle Noisy Data?
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human (e.G., Deal with possible
outliers)
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34
Regression for Data Smoothing
• The regression functions are used to determine the
relationship between the dependent variable (target
field) and one or more independent variables. The
dependent variable is the one whose values you want to
predict, whereas the independent variables are the
variables that you base your prediction on.
• A RegressionModel defines three types of regression
models: linear, polynomial, and logistic regression.
• The modelType attribute indicates the type of
regression used.
• Linear and stepwise-polynomial regression are designed
for numeric dependent variables having a continuous
spectrum of values. These models should contain
For linear and stepwise regression, the regression formula exactly one regression table.
is:
Dependent variable = intercept + Sumi (coefficienti * • Logistic regression is designed for categorical dependent
independent variablei ) + error variables.
Clustering for Data Smoothing: Outlier Removal
• Solution:
(a) equi-depth partitioning -Bin-1: 5, 10, 11, 13, Bin-2: 15, 35, 50, 55, Bin-03:
72, 92, 204, 215
(b) Smoothing by bin boundaries: Bin-1: 5,13,13,13 , Bin-2: 15,15,55,55, Bin-
3:72,72,215,215
Data Warehouse and Data Mining
24
24
Data Integration(Contd..)
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id ≡ B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Redundancy:
• Inconsistencies in attribute or dimensions naming can cause redundancy
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different Possible reasons: different representations, different scales, e.g.,
25
metric vs. British units 25
Handling Redundancy in Data Integration
26
26
Detection of Data Redundancy-Correlation
Where n is the number of tuples, ai and bi are the respective values of A and
B in tuple i,
are the respective mean values of A and B,
are the respective standard deviations of A and B
is the sum of the AB cross-product (i.e., for each tuple, the value for A
is multiplied by the value for B in that tuple)
• Note that
• If resulting value is greater than 0, then A and B are positively correlated, meaning
that the values of A increase as the values of B increase
• Higher the value, the stronger the correlation
• Higher value may indicate that A (or B) may be removed as a redundancy
• If the resulting value is equal to 0, then A and B are independent and there is no
correlation between them
• If the resulting value is less than 0, then A and B are negatively correlated, where
the values of one attribute increase as the values of the other attribute decrease
• Mean values of A and B, respectively, are also known as the expected values on A
and B, that is
• Table shows stock prices of two companies at five time points. If the stocks are
affected by same industry trends, determine whether their prices rise or fall
together?
40
40
Data Transformation: Normalization
• Min-max normalization
• Z-score normalization
• Eg. Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. We would like to map income to the range [0.0, 1.0]. By min-max
normalization, a value of $73,600 for income is transformed to 0.716
Unit I- Data Preprocessing
42
Data Transformation: Z-score Normalization
• Eg. Suppose that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively.
• Define Normalization.
• What is the value range of min-max. Use min-max normalization to
normalize the following group of data: 8,10,15,20.
• Solution:
Marks Marks after Min-
Max normalization
8
10
15
20
Exercise
• Define Normalization.
• What is the value range of min-max. Use min-max normalization to
normalize the following group of data: 8,10,15,20.
• Solution:
Marks Marks after Min-
Max normalization
8 0
10 0.16
15 0.58
20 1
Data Mining
• Introduction to KDD
• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary
48
Data Reduction Strategies
49
Data Cube Aggregation
52
Dimensionality reduction :Attribute Subset Selection
• Reduces the data set size by removing irrelevant or redundant attributes (or
dimensions)
• Goal of attribute subset selection is to find a minimum set of attributes
• Improves speed of mining as dataset size is reduced
• Mining on a reduced data set also makes the discovered pattern easier to
understand
• Redundant attributes
• Duplicate information contained in one or more attributes
• E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining task at hand
• E.g., students' telephone number is often irrelevant to the task of predicting students' CGPA
53
Attribute Subset Selection
54
Heuristic(Greedy) methods for attribute subset selection
55
Heuristic(Greedy) methods for attribute subset selection(cont)
56
Example of Decision Tree Induction
Data Reduction 2: Numerosity Reduction
• Parametric methods
• Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
• E.g.: Log-linear models: obtain value at a point in m-D space as the product on
appropriate marginal subspaces
• Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling
• Histograms (or frequency histograms) are at least a century old and are widely used.
• Plotting histograms is a graphical method for summarizing the distribution of a given attribute, X.
• Height of the bar indicates the frequency (i.e., count) of that X value
• Range of values for X is partitioned into disjoint consecutive subranges.
• Subranges, referred to as buckets or bins, are disjoint subsets of the data distribution for X.
• Range of a bucket is known as the width
• Typically, the buckets are of equal width.
• Eg. a price attribute with a value range of $1 to $200 can be partitioned into subranges 1 to 20, 21
to 40, 41 to 60, and so on.
• For each subrange, a bar is drawn with a height that represents the total count of items observed
within the subrange
60
Histograms
As you can see from the graph, that most of the bars that have high frequency lies in the first half
portion which is the darker portion. That means that the image we have got is darker
64
Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless
• But only limited manipulation is possible without expansion
• Audio/video, image compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
• Time sequence is not audio
• Typically short and vary slowly with time
sy
los
Original Data
Approximated
• Partition data set into clusters, and store cluster representation only
• Quality of clusters measured by their diameter (max distance
between any two objects in the cluster) or centroid distance (avg.
distance of each cluster object from its centroid)
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering (possibly stored in multi-dimensional
index tree structures (B+-tree, R-tree, quad-tree, etc))
• There are many choices of clustering definitions and clustering
algorithms (further details later)
69
Types of Sampling
OR
SR S W
r a n d om
im p le hout
(s w it
le
samp ement)
c
repla
SRSW
R
Raw Data
70
Sampling
72
Discretization
y1 y2 y3 y4 y5 y6
73
Discretization and Concept Hierarchies
• Discretization
• reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels can
then be used to replace actual data values.
• Concept Hierarchies
• reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
74
Discretization and Concept Hierarchies : Numerical data
• Entropy-based discretization
75
Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)
76
Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
77
Reference