Topics
Loopingto be covered
• Why to pre-process data?
• Mean, Median, Mode, Range & Standard Deviation
• Attribute Types
• Data Summarization
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
Section - 1
Why to pre-process data?
Data pre-processing is a data mining technique that involves transforming raw data
(real world data) into an understandable format.
Real-world data is often incomplete, inconsistent, lacking in certain behaviors or
trends and likely to contain many errors.
Incomplete: Missing attribute values, lack of certain attributes of interest, or containing only
aggregate data.
E.g. Occupation = ― ‖
Noisy: Containing errors or outliers.
E.g. Salary = ―abcxy‖
Inconsistent: Containing similarity in codes or names.
E.g. ―Gujarat‖ & ―Gujrat‖ (Common mistakes like spelling, grammar, articles)
Why to pre-process data? (Cont..) No quality data, No quality
results
It looks like Garbage In Garbage Out (GIGO).
Quality decisions must be based on quality data.
Duplicate or missing data may cause incorrect or even misleading statistics.
Data preprocessing prepares raw data for further processing.
Data preparation, cleaning and
transformation are the majority task
(90%) in data mining.
Section - 2
Mean is the average of a
Mean (Average) dataset
Mean is the average of a dataset.
The mean is the total of all the values, divided by the number of values.
Formula to find mean
Example
Find out mean for 12, 15, 11, 11, 7, 13 (Here total data is = 6)
First, find the sum of the data.
12 + 15 +11 + 11 + 7 + 13 =
69
Then divide by the total number of
data. (Mean
69 / 6 = 11.5
)
Median {Centre Or Middle Value}
The median is the middle number in a list of numbers ordered from lowest to highest.
If count is Odd then middle number is
Example Median
Find out Median for 12, 15, 11, 11, 7, 13, 15 (Here total data is = 7 {odd})
First, arrange the data in ascending
order.
7, 11, 11, 12, 13, 15, 15
Partitioning data into equal half's
7, 11, 11, 12, 13, 15, 15
12
Median
Median {Centre Or Middle Value} (Cont..)
If count is Even then take average (mean) of
middle two numbers that is Median
Example
Find out Median for 12, 15, 11, 11, 7, 13 (Here total data is = 6 {even})
First, arrange the data in ascending
order.
7, 11, 11, 12, 13, 15
Calculate an average (mean) of the two numbers in
the middle.
7, 11, 11, 12, 13, 15
(11 + 12)/2 = 11.5
Median
Mode
The mode is the number that occurs most often within a set of numbers.
Example
12, 15, 11, 11, 7, 13 12, 12 15, 11, 11, 7, 13,
7
11 Mode (Unimodal) 7, 11, 12 Mode
(Trimodal)
12, 15, 11, 11, 7, 12, 12, 15, 11, 10, 7, 14, 13
13
11, 12 Mode No Mode
(Bimodal)
If more than three numbers repeats within a set of numbers then it is called as
multimodal.
Range
The range of a set of data is the difference between the largest and the smallest
number in the set.
Example
Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50
First, arrange the data in ascending order.
26, 30, 34, 40, 40, 42, 43, 47, 48, 50,
50, 55
In our example largest number is 55, and subtract the smallest number is 26.
55 – 26 = 29
Range
Standard Deviation (σ)
Standard Deviation (σ) Cont..
Standard Deviation (σ) Cont..
The owner of the Indian restaurant is interested in how much people spend at the restaurant.
He examines 8 randomly selected receipts for parties and writes down the following data.
44, 50, 38, 96, 42, 47, 40, 39
1. Find out Mean (Mean is 49.5 for given data)
2. Write a table that subtracts the mean from each observed value. (2nd step)
Step : X X - Mean Value (X – Mean)2 Step :
3 44 44 - 49.5 -5.5 30.25 4
50 50 - 49.5 0.5 0.25
38 38 - 49.5 11.5 132.25
96 96 - 49.5 46.5 2162.25
42 42 - 49.5 -7.5 56.25
47 47 - 49.5 -2.5 6.25 Step :
40 40 - 49.5 -9.5 90.25 5
39 39 - 49.5 -10.5 110.25
Total 2588
Standard Deviation (σ) Cont..
Standard deviation can be thought of measuring how far the data values lie from the
mean, we take the mean and move on standard deviation in either direction.
The mean for this example is µ = 49.5 and the standard deviation is σ = 8.
Now, we add & subtract values with mean like 49.5 - 8 = 41.5 and 49.5 + 8 = 57.5
This means that most of the data probably spend between 41.5 and 57.5.
38, 39, 40, 42, 44, 47, 50, 96
If all data are same then variance & standard deviation is 0 (zero).
Summary
• Mean: Mean is the average of a dataset
Mean
• Median: Median is the middle number in a
Median dataset when the data is arranged in numerical
order (Sorted Order).
Definition • Mode: The mode is the number that occurs
most often within a set of numbers.
Standard
Mode
Deviation
• Range: The range of a set of data is the
difference between the largest and the smallest
Range
number in the set.
• Standard Deviation: The Standard Deviation is
a measure of how numbers are spread out in
dataset.
Section - 3
What is an Attribute?
The attribute can be defined as a field for storing the data that represents the
characteristics of a data object.
It can also be viewed as a property, characteristics, feature or column of a data
object.
It represents the different features of an object (real world entity) like..
👨 Person Name, Age, Qualification, Birthdate etc.
👨 Computer Brand, Model, Processor, RAM etc.
📚 Book Book Name, Author, Price, ISBN etc.
An attribute set defines an object.
The object is also referred to as a record of the instances or entity.
Attribute Types
Attribute types can be divided into mainly two categories.
1. Quantitative
1. Discrete
2. Continuous
2. Qualitative
Quantitati Qualitativ
1. Nominal ve e
2. Ordinal
3. Binary • Nominal
• Discreate
• Ordinal
1. Symmetric • Continuous
• Binary
2. Asymmetric • Symmetric
• Asymmetric
1. Quantitative Attribute Attribute Types
Quantitative is an adjective that simply means something that can be measured.
It is a special attribute that is used to compare two values, i.e., it is used to compare
a user-defined value against an upper limit and a lower limit.
Example
We can count the number of sheep on a farm or measure the liters of milk produced by a cow.
Consider a query to find all patients with low or high blood glucose levels. In database, for each
patient a lower value and an upper value for blood glucose level is stored in the Result class.
To find patients with low/high level of blood glucose, without QA you would have to specify a limit
on the Low attribute or the High attribute of the Result class.
While defining limit you can use Between, Equals, Less than, Less than or Equal to, Greater
than, Greater than or Equal as relational operators.
1. Quantitative Attribute Attribute Types
1) Discrete Attribute
A discrete attribute has a finite or countably infinite set of values, which may or may not be
represented as integers.
The attributes hair_color, smoker, medical_test, and drink_size each have a finite number of
values, and so are discrete.
CustomerID in a table has countably infinite set of values because over a time period it grows.
2) Continues Attribute
Real numbers as attribute values.
The attributes temperature, height, or weight are the examples of continuous attributes.
Practically, real values can only be measured and represented using a finite number of digits.
Continuous attributes are typically represented as floating- point variables.
2. Qualitative Attribute Attribute Types
Qualitative data deals with characteristics and descriptors that can't be easily
measured, but can be observed subjectively—such as smells, tastes, textures,
attractiveness, and color.
Simple arithmetic attributes that is named or described in words.
It is represented in integer or real values.
Results of qualitative attribute are often quoted on scales.
Below are the qualitative Attributes.
Nominal
Ordinal
Binary
Symmetric
Asymmetric
2. Qualitative Attribute Cont.. Attribute Types
1) Nominal Attribute
Nominal attributes are named attributes which can be separated into discrete (individual)
categories which do not overlap.
Nominal attributes values also called as distinct values.
Example
2. Qualitative Attribute Cont.. Attribute Types
2) Ordinal Attribute
Ordinal attribute is the order of the values, that’s important and significant, but the differences
between each one is not really known.
Example
Rankings 1st, 2nd, 3rd
Ratings ,
We know that a 5 star is better than a 2 star or 3 star, but we don’t know and cannot quantify–how
much better it is?
3) Binary Attribute
Binary attributes are the categorical attributes with only two possible values (yes or no), (true or
false), (0 or 1).
Symmetric binary attribute is the attribute which each value is equally valuable (male or female).
The male here is not more important than the female value.
Asymmetric is the attribute which the two states is not equally important, for example, the medical
test (positive or negative), here, the positive results is more significant than the negative one.
Extra Attribute Types
Interval Attribute
Interval attribute comes in the form of a numerical value where the difference between points is
meaningful.
Example
Temperature 10°-20°, 30°-50°, 35°-45°
Calendar Dates 15th – 22nd, 10th – 30th
We can not find true zero (absolute) value with interval attributes.
Ratio Attribute
Ratio attribute is looks like interval attribute, but it must have a true zero (absolute) value.
It tells us about the order and the exact value between units or data.
Example
Age Group 10-20, 30-50, 35-45 (In years)
Mass 20-30 kg, 10-15 kg
It does have a true zero (absolute) so, it is possible to compute ratios.
Section - 4
Why Data Summarization?
As we are living in a digital world where data transfers in a second and it is much
faster than a human capability.
In the corporate field, employees work on a huge volume of data which is derived
from different sources like Social Network, Media, Newspaper, Book, cloud media
storage etc.
But sometimes it may create difficulties for you, to summarize the data.
Sometimes you do not expect data volume because when you retrieve data from
relational sources you can not predict that how much data will be stored in the
database.
As a result, data becomes more complex and takes time to summarize information.
What is Data Summarization?
Summarization is a key data mining concept which involves techniques for finding a
compact description of a dataset.
It is aimed at extracting useful information and general trends from the raw data.
Two methods for data summarization are through tables and graphs.
Tables are row & column representation of the dataset, you can apply aggregate functions on it.
Graphs showing the relation between variable quantities, typically of two variables, each
measured along one of a pair of axes at right angles.
Section - 4
Data Cleaning
1. Fill in missing values
1. Ignore the tuple
2. Fill missing value manually
3. Fill in the missing value automatically
4. Use a global constant to fill in the missing value
2. Identify outliers and smooth out noisy data
1. Binning Method
2. Regression
3. Clustering
3. Correct inconsistent data
4. Resolve redundancy caused by data integration
1) Fill in missing values Data
Cleaning
Ignore the tuple (record/row):
• Usually done when class label is missing.
• Example
o The task is to distinguish between two types of emails, ―spam‖ and ―non-spam‖ (Ham).
o Spam & non-spam are called as class label.
o If an email comes to you, in which class label is missing then it is discarded.
Fill missing value manually:
• Use the attribute mean (average) to fill in the missing value and also use the attribute mean
(average) for all samples belonging to the same class.
Fill in the missing value automatically:
• Predict the missing value by using a learning algorithm:
o Consider the attribute with the missing value as a dependent variable and run a learning algorithm
(usually Naive Bayes or Decision tree) to predict the missing value.
Use a global constant to fill in the missing value
• Replace all missing attribute values by the same constant such as a label like
“Unknown”.
2) Identify outliers and smooth out noisy data Data
Cleaning
There are three data smoothing techniques as follows..
1. Binning :
Binning methods smooth a sorted data value by consulting its ―neighborhood‖ that is, the values
around it.
2. Regression :
It conforms data values to a function.
Linear regression involves finding the ―best‖ line to fit two attributes (or variables) so that one
attribute can be used to predict the other.
3. Outlier analysis :
Outliers may be detected by clustering for example, where similar values are organized into
groups or ―clusters‖.
In this, values that fall outside of the set of clusters may be considered as outliers.
1. Binning Method Data
Cleaning
Binning method is a top-down splitting technique based on a specified number of
bins.
In this method the data is first sorted and then the sorted values are distributed into a
number of buckets or bins.
For example, attribute values can be discretized (separated) by applying equal-width
or equal-frequency binning, and then replacing each value by the bin mean, median
or boundaries.
It can be applied recursively to the resulting partitions to generate concept
hierarchies.
It does not use class information, therefore it is called as unsupervised discretization
technique.
It used to minimize the effects of small observation errors.
Identify outliers and smooth out noisy
1. Binning Method Cont.. Data
Cleaning
There are basically two types of binning approaches..
1. Equal width (or distance) binning :
The simplest binning approach is to partition the range of the variable into k equal-width
intervals.
The interval width is simply the range [Min, Max] of the variable divided by N,
Width = Max – Min / N (Number of Bins)
Example
Data: 5,10,11,13,15, 35, 50, 55, 72, 92, 204, 215
As per above formula we have Max=215, Min=5, Number of Bins=3
70+5=75 (from 5 to 75) = Bin 1: 5,10,11,13,15, 35, 50, 55, 72
70+75=145 (from 75 to 145) = Bin 2: 92
70+145=215 (from 145 to 215) = Bin 3: 204, 215
2. Equal depth (or frequency) binning :
In equal-frequency binning we divide the range [Max, Min] of the variable into intervals that
contain (approximately)Identify
equal number of points;
outliers and smoothequal frequency
out noisy data may not be possible due to
repeated values.
1. Binning Method Cont.. Data
Cleaning
Bin Operations
1. Smoothing by bin means
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
2. Smoothing by bin median
In this method each bin value is replaced by its bin median value.
3. Smoothing by bin boundary
In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries.
Each bin value is then replaced by the closest boundary value.
Identify outliers and smooth out noisy
Binning Method Example – {Bin Means} Data
Cleaning
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34 Bin Means
Step: 2 (4 + 8 + 9 + 15)/4 = 9
• Smoothing by bin means:
(21 + 21 + 24 + 25)/4 = 23
Bin 1: 9, 9, 9, 9 (26 + 28 + 29 + 34)/4 = 29
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Identify outliers and smooth out noisy
Binning Method Example – {Bin Boundaries} Data
Cleaning
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Step: 2
• Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
Identify outliers and smooth out noisy
2. Regression Data
Cleaning
Data smoothing can also be done by regression, a technique that conforms data
values to a function.
Regression analysis is a way to find trends in data & it is also called as
mathematically describes the relationship between independent variables and the
dependent variable.
It can be divided into two categories..
1. Linear regression :
It involves finding the ―best‖ line to fit two attributes (or variables) so that one attribute can be used to
predict the other.
In this, analysis on a single x variable for each dependent ―y‖ variable. For example: (x1, Y1).
2. Multiple linear regression :
An extension of linear regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
It uses multiple ―x‖ variables for each independent variable: (x1)1, (x2)1, (x3)1, Y1).
Identify outliers and smooth out noisy
3. Clustering Data
Cleaning
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
Cluster analysis as such is not an automatic task, but an iterative process
of knowledge discovery or interactive multi-objective optimization that involves trial
and failure.
It is often necessary to modify data preprocessing and model parameters until the
result achieves the desired properties.
Identify outliers and smooth out noisy
Correct Inconsistent Data Data
Cleaning
With larger datasets, it can be difficult to find all of the inconsistencies.
It contains similarity in codes or names.
We can manually solve common mistakes like spelling, grammar, articles or use
other tools for it.
Resolve redundancy caused by data integration Data
Cleaning
Data redundancy occurs in database systems which have a field that is repeated
in two or more tables.
When customer data is duplicated and attached with each product bought, then
redundancy of data is known as inconsistency.
So, the entity "customer" might appear with different values.
Database normalization prevents redundancy and makes the best possible usage
of storage.
The proper use of foreign keys can minimize data redundancy and reduce the
chance of destructive anomalies appearing.
Section - 5
Data Integration
Combines data from multiple sources into a coherent store
Schema integration: e.g., [Link]-id [Link]-#
Integrate metadata from different sources
Entity identification problem
Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have different names in different
databases
Derivable data: One attribute may be a ―derived‖ attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis and
covariance analysis.
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality.
Section - 6
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of
replacement values that each old value can be identified with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
Min-max normalization
Z-score normalization
Normalization by decimal scaling
Discretization: Concept hierarchy climbing
1. Min-Max Normalization Data
Transformation
Min max is a technique that helps
to normalizing the data.
It will scale the data between 0
Given data
and 1 or within specified range.
Min : Minimum value = 16
Example Max : Maximum value = 40
V = Respective value of attributes.
In our example V1= 16, V2=20,
V3=30 & V4=40.
NewMax = 1
NewMin = 0
1. Min-Max Normalization Cont.. Data
Transformation
Exampl
e
For Age 16 : For Age 30 :
MinMax (v’) = (16 – 16)/(40-16) * (1 – 0) + MinMax (v’) = (30 – 16)/(40-16) * (1 – 0) +
0 0
= 0 / 24 * 1 = 14 / 24 * 1
=0 = 0.58
For Age 20 : For Age 40 :
MinMax (v’) = (20 – 16)/(40-16) * (1 – 0) + MinMax (v’) = (40 – 16)/(40-16) * (1 – 0) +
0 0
= 4 / 24 * 1 = 24 / 24 * 1
= 0.16 =1
Age After Min-max
normalization
16 0
20 0.16
30 0.58
40 1
2. Decimal Scaling Data
Transformation
In this technique we move the Examp
decimal point of values of the le CGPA Formula After Decimal
attribute. Scaling
2 2 / 10 0.2
This movement of decimal points
totally depends on the maximum 3 3 / 10 0.3
value among all values in the We will check maximum value among our attribute
attribute. CGPA.
Value V of attribute A can be Maximum value is 3 so, we can convert it into decimal
normalized by the following formula by dividing with 10. why 10?
We will count total digits in our maximum value and
Normalized valueinteger
Where j is the smallest of attribute
such that
Max(|ν’|) then put 1.
V’=<V1/ 10j
After 1 we can put zeros equal to the length of
maximum value.
Here 3 is maximum value and total digits in this value
is only 1 so, we will put one zero after 1.
3. Z-Score Normalization Data
Transformation
It is also called zero-mean normalization.
The essence of this technique is the data transformation by the values conversation
to a common scale where an average number equals zero and a standard deviation
is one.
To find z-score values..
Where μ: Mean, σ: Standard deviation
Example
Let μ = 54,000, σ = 16,000
Find z-score for 73,600,
Z-score for 73600:
1.225
Section - 7
Data Reduction
Why Data Reduction?
A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
What is Data Reduction?
Data reduction process reduces the size of data and makes it suitable and feasible for analysis.
In the reduction process, integrity of the data must be preserved and data volume is reduced.
There are many techniques that can be used for data reduction like
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
1. Dimensionality Reduction Data
Reduction
Dimensionality reduction, or dimension reduction, is the transformation of data
from a high-dimensional space into a low-dimensional space so that the low-
dimensional representation retains some meaningful properties of the original data,
ideally close to its intrinsic dimension.
The number of input variables or features for a dataset is referred to as its
dimensionality.
Dimensionality reduction refers to techniques that reduce the number of input
variables in a dataset.
Example
Dimensional reduction can be discussed through a simple e-mail classification problem, where
we need to classify whether the e-mail is spam or not.
This can involve a large number of features, such as whether or not the e-mail has a generic title,
the content of the e-mail, whether the e-mail uses a template, etc.
1. Dimensionality Reduction Cont.. Data
Reduction
A 3-D classification problem
can be hard to visualize,
whereas a 2-D one can be
mapped to a simple 2
dimensional space, and a 1-D
problem to a simple line.
2. Numerosity Reduction Data
Reduction
Numerosity Reduction is a data reduction technique which replaces the
original data by smaller form of data representation.
There are two techniques for numerosity reduction- Parametric and Non-
Parametric methods.
Parametric Methods
For parametric methods, data is represented using some model.
The model is used to estimate the data, so that only parameters of data are required to
be stored, instead of actual data.
Regression and Log-Linear methods are used for creating such models.
Non-Parametric Methods
These methods are used for storing reduced representations of the data
include histograms, clustering, sampling and data cube aggregation.
Regression Data
Reduction
Regression can be a simple linear regression or multiple linear regression.
When there is only single independent attribute, such regression model is
called simple linear regression and if there are multiple independent attributes,
then such regression models are called multiple linear regression.
In linear regression, the data are modeled to a fit straight line.
For example, a random variable y can be modeled as a linear function of
another random variable x with the equation y = ax+b
Where a and b (regression coefficients) specifies the slope and y-intercept of
the line, respectively.
In multiple linear regression, y will be modeled as a linear function of two or
more predictor (independent) variables.
Log-Linear Model Data
Reduction
Log-linear model can be used to estimate the probability of each data point in
a multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
This allows a higher-dimensional data space to be constructed from lower-
dimensional attributes.
Regression and log-linear model can both be used on sparse data (most of
the elements are zero), although their application may be limited.
Non-Parametric Methods Data
Reduction
Histograms
Histogram is the data representation in terms of frequency.
It uses binning to approximate data distribution and is a popular form of data reduction.
Clustering
Clustering divides the data into groups/clusters, it partitions the whole data into different
clusters.
In data reduction, the cluster representation of the data are used to replace the actual
data, It also helps to detect outliers in data.
Sampling
Sampling can be used for data reduction because it allows a large data set to be
represented by a much smaller random data sample (or subset).
Data Cube Aggregation
Data cube aggregation involves moving the data from detailed level to a fewer number
of dimensions.
The resulting data set is smaller in volume, without loss of information
necessary for the analysis task.
3. Data Compression Data
Reduction
Data Compression is a reduction in the number of bits needed to represent data.
Compressing data can save storage capacity, speed up file transfer, and decrease
costs for storage hardware and network bandwidth.
Compressing data can be a lossless or lossy process.
Lossless compression
It enables the restoration of a file to its original state, without the loss of a single bit of data, when the file
is uncompressed.
Lossless compression is the typical approach with executables, as well as text and spreadsheet files,
where the loss of words or numbers would change the information.
Lossy compression
It permanently eliminates bits of data that are redundant, unimportant or imperceptible.
Lossy compression is useful with graphics, audio, video and images, where the removal of some data
bits has little or no discernible effect on the representation of the content.