0% found this document useful (0 votes)

25 views58 pages

Data Preprocessing Techniques Explained

2 nd unit notes

Uploaded by

Ganta Vinkitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views58 pages

Data Preprocessing Techniques Explained

2 nd unit notes

Uploaded by

Ganta Vinkitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

 Topics

Loopingto be covered
• Why to pre-process data?
• Mean, Median, Mode, Range & Standard Deviation
• Attribute Types
• Data Summarization
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
Section - 1
Why to pre-process data?
Data pre-processing is a data mining technique that involves transforming raw data
(real world data) into an understandable format.
Real-world data is often incomplete, inconsistent, lacking in certain behaviors or
trends and likely to contain many errors.
 Incomplete: Missing attribute values, lack of certain attributes of interest, or containing only
aggregate data.
 E.g. Occupation = ― ‖
 Noisy: Containing errors or outliers.
 E.g. Salary = ―abcxy‖
 Inconsistent: Containing similarity in codes or names.
 E.g. ―Gujarat‖ & ―Gujrat‖ (Common mistakes like spelling, grammar, articles)
Why to pre-process data? (Cont..) No quality data, No quality
results
It looks like Garbage In Garbage Out (GIGO).

 Quality decisions must be based on quality data.

 Duplicate or missing data may cause incorrect or even misleading statistics.
 Data preprocessing prepares raw data for further processing.

Data preparation, cleaning and

transformation are the majority task
(90%) in data mining.
Section - 2
Mean is the average of a
Mean (Average) dataset
Mean is the average of a dataset.
The mean is the total of all the values, divided by the number of values.
Formula to find mean
Example
 Find out mean for 12, 15, 11, 11, 7, 13 (Here total data is = 6)

First, find the sum of the data.

12 + 15 +11 + 11 + 7 + 13 =
69
Then divide by the total number of
data. (Mean
69 / 6 = 11.5
)
Median {Centre Or Middle Value}
The median is the middle number in a list of numbers ordered from lowest to highest.
If count is Odd then middle number is
Example Median
 Find out Median for 12, 15, 11, 11, 7, 13, 15 (Here total data is = 7 {odd})

First, arrange the data in ascending

order.
7, 11, 11, 12, 13, 15, 15
Partitioning data into equal half's
7, 11, 11, 12, 13, 15, 15
12
Median
Median {Centre Or Middle Value} (Cont..)
If count is Even then take average (mean) of
middle two numbers that is Median
Example
 Find out Median for 12, 15, 11, 11, 7, 13 (Here total data is = 6 {even})

First, arrange the data in ascending

order.
7, 11, 11, 12, 13, 15
Calculate an average (mean) of the two numbers in
the middle.
7, 11, 11, 12, 13, 15
(11 + 12)/2 = 11.5
Median
Mode
The mode is the number that occurs most often within a set of numbers.
Example

12, 15, 11, 11, 7, 13 12, 12 15, 11, 11, 7, 13,

7
11 Mode (Unimodal) 7, 11, 12 Mode
(Trimodal)

12, 15, 11, 11, 7, 12, 12, 15, 11, 10, 7, 14, 13
13
11, 12 Mode No Mode
(Bimodal)

If more than three numbers repeats within a set of numbers then it is called as
multimodal.
Range
The range of a set of data is the difference between the largest and the smallest
number in the set.
Example
 Find the range for given data 40, 30, 43, 48, 26, 50, 55, 40, 34, 42, 47, 50

First, arrange the data in ascending order.

26, 30, 34, 40, 40, 42, 43, 47, 48, 50,

50, 55
In our example largest number is 55, and subtract the smallest number is 26.

55 – 26 = 29
Range
Standard Deviation (σ)

Standard Deviation (σ) Cont..

Standard Deviation (σ) Cont..
The owner of the Indian restaurant is interested in how much people spend at the restaurant.
He examines 8 randomly selected receipts for parties and writes down the following data.
44, 50, 38, 96, 42, 47, 40, 39
1. Find out Mean (Mean is 49.5 for given data)
2. Write a table that subtracts the mean from each observed value. (2nd step)

Step : X X - Mean Value (X – Mean)2 Step :

3 44 44 - 49.5 -5.5 30.25 4
50 50 - 49.5 0.5 0.25
38 38 - 49.5 11.5 132.25
96 96 - 49.5 46.5 2162.25
42 42 - 49.5 -7.5 56.25
47 47 - 49.5 -2.5 6.25 Step :
40 40 - 49.5 -9.5 90.25 5
39 39 - 49.5 -10.5 110.25
Total 2588
Standard Deviation (σ) Cont..
Standard deviation can be thought of measuring how far the data values lie from the
mean, we take the mean and move on standard deviation in either direction.
The mean for this example is µ = 49.5 and the standard deviation is σ = 8.
Now, we add & subtract values with mean like 49.5 - 8 = 41.5 and 49.5 + 8 = 57.5
This means that most of the data probably spend between 41.5 and 57.5.
 38, 39, 40, 42, 44, 47, 50, 96
If all data are same then variance & standard deviation is 0 (zero).
Summary
• Mean: Mean is the average of a dataset
Mean
• Median: Median is the middle number in a
Median dataset when the data is arranged in numerical
order (Sorted Order).

Definition • Mode: The mode is the number that occurs

most often within a set of numbers.
Standard
Mode
Deviation
• Range: The range of a set of data is the
difference between the largest and the smallest
Range
number in the set.

• Standard Deviation: The Standard Deviation is

a measure of how numbers are spread out in
dataset.
Section - 3
What is an Attribute?
 The attribute can be defined as a field for storing the data that represents the
characteristics of a data object.
 It can also be viewed as a property, characteristics, feature or column of a data
object.
 It represents the different features of an object (real world entity) like..
👨 Person  Name, Age, Qualification, Birthdate etc.
👨 Computer  Brand, Model, Processor, RAM etc.
📚 Book  Book Name, Author, Price, ISBN etc.
An attribute set defines an object.
The object is also referred to as a record of the instances or entity.
Attribute Types
Attribute types can be divided into mainly two categories.
1. Quantitative
1. Discrete
2. Continuous

2. Qualitative
Quantitati Qualitativ
1. Nominal ve e
2. Ordinal
3. Binary • Nominal
• Discreate
• Ordinal
1. Symmetric • Continuous
• Binary
2. Asymmetric • Symmetric
• Asymmetric
1. Quantitative Attribute Attribute Types

Quantitative is an adjective that simply means something that can be measured.

It is a special attribute that is used to compare two values, i.e., it is used to compare
a user-defined value against an upper limit and a lower limit.
Example
 We can count the number of sheep on a farm or measure the liters of milk produced by a cow.
 Consider a query to find all patients with low or high blood glucose levels. In database, for each
patient a lower value and an upper value for blood glucose level is stored in the Result class.
 To find patients with low/high level of blood glucose, without QA you would have to specify a limit
on the Low attribute or the High attribute of the Result class.
 While defining limit you can use Between, Equals, Less than, Less than or Equal to, Greater
than, Greater than or Equal as relational operators.
1. Quantitative Attribute Attribute Types

1) Discrete Attribute

 A discrete attribute has a finite or countably infinite set of values, which may or may not be
represented as integers.
 The attributes hair_color, smoker, medical_test, and drink_size each have a finite number of
values, and so are discrete.
 CustomerID in a table has countably infinite set of values because over a time period it grows.

2) Continues Attribute

 Real numbers as attribute values.
 The attributes temperature, height, or weight are the examples of continuous attributes.
 Practically, real values can only be measured and represented using a finite number of digits.
 Continuous attributes are typically represented as floating- point variables.
2. Qualitative Attribute Attribute Types

Qualitative data deals with characteristics and descriptors that can't be easily
measured, but can be observed subjectively—such as smells, tastes, textures,
attractiveness, and color.
Simple arithmetic attributes that is named or described in words.
It is represented in integer or real values.
Results of qualitative attribute are often quoted on scales.
Below are the qualitative Attributes.
 Nominal
 Ordinal
 Binary
 Symmetric
 Asymmetric
2. Qualitative Attribute Cont.. Attribute Types

1) Nominal Attribute
 Nominal attributes are named attributes which can be separated into discrete (individual)
categories which do not overlap.
 Nominal attributes values also called as distinct values.
 Example
2. Qualitative Attribute Cont.. Attribute Types

2) Ordinal Attribute
 Ordinal attribute is the order of the values, that’s important and significant, but the differences
between each one is not really known.
 Example
 Rankings  1st, 2nd, 3rd
 Ratings  ,
 We know that a 5 star is better than a 2 star or 3 star, but we don’t know and cannot quantify–how
much better it is?
3) Binary Attribute
 Binary attributes are the categorical attributes with only two possible values (yes or no), (true or
false), (0 or 1).
 Symmetric binary attribute is the attribute which each value is equally valuable (male or female).
The male here is not more important than the female value.
 Asymmetric is the attribute which the two states is not equally important, for example, the medical
test (positive or negative), here, the positive results is more significant than the negative one.
Extra Attribute Types

Interval Attribute
 Interval attribute comes in the form of a numerical value where the difference between points is
meaningful.
 Example
 Temperature  10°-20°, 30°-50°, 35°-45°
 Calendar Dates  15th – 22nd, 10th – 30th
 We can not find true zero (absolute) value with interval attributes.
Ratio Attribute
 Ratio attribute is looks like interval attribute, but it must have a true zero (absolute) value.
 It tells us about the order and the exact value between units or data.
 Example
 Age Group  10-20, 30-50, 35-45 (In years)
 Mass  20-30 kg, 10-15 kg
 It does have a true zero (absolute) so, it is possible to compute ratios.
Section - 4
Why Data Summarization?
 As we are living in a digital world where data transfers in a second and it is much
faster than a human capability.
 In the corporate field, employees work on a huge volume of data which is derived
from different sources like Social Network, Media, Newspaper, Book, cloud media
storage etc.
 But sometimes it may create difficulties for you, to summarize the data.
 Sometimes you do not expect data volume because when you retrieve data from
relational sources you can not predict that how much data will be stored in the
database.
 As a result, data becomes more complex and takes time to summarize information.
What is Data Summarization?
 Summarization is a key data mining concept which involves techniques for finding a
compact description of a dataset.
 It is aimed at extracting useful information and general trends from the raw data.
 Two methods for data summarization are through tables and graphs.
 Tables are row & column representation of the dataset, you can apply aggregate functions on it.
 Graphs showing the relation between variable quantities, typically of two variables, each
measured along one of a pair of axes at right angles.
Section - 4
Data Cleaning
1. Fill in missing values
1. Ignore the tuple
2. Fill missing value manually
3. Fill in the missing value automatically
4. Use a global constant to fill in the missing value
2. Identify outliers and smooth out noisy data
1. Binning Method
2. Regression
3. Clustering
3. Correct inconsistent data
4. Resolve redundancy caused by data integration
1) Fill in missing values Data
Cleaning
 Ignore the tuple (record/row):
• Usually done when class label is missing.
• Example
o The task is to distinguish between two types of emails, ―spam‖ and ―non-spam‖ (Ham).
o Spam & non-spam are called as class label.
o If an email comes to you, in which class label is missing then it is discarded.
 Fill missing value manually:
• Use the attribute mean (average) to fill in the missing value and also use the attribute mean
(average) for all samples belonging to the same class.
 Fill in the missing value automatically:
• Predict the missing value by using a learning algorithm:
o Consider the attribute with the missing value as a dependent variable and run a learning algorithm
(usually Naive Bayes or Decision tree) to predict the missing value.
 Use a global constant to fill in the missing value
• Replace all missing attribute values by the same constant such as a label like
“Unknown”.
2) Identify outliers and smooth out noisy data Data
Cleaning
There are three data smoothing techniques as follows..
1. Binning :
 Binning methods smooth a sorted data value by consulting its ―neighborhood‖ that is, the values
around it.
2. Regression :
 It conforms data values to a function.
 Linear regression involves finding the ―best‖ line to fit two attributes (or variables) so that one
attribute can be used to predict the other.
3. Outlier analysis :
 Outliers may be detected by clustering for example, where similar values are organized into
groups or ―clusters‖.
 In this, values that fall outside of the set of clusters may be considered as outliers.
1. Binning Method Data
Cleaning
Binning method is a top-down splitting technique based on a specified number of
bins.
In this method the data is first sorted and then the sorted values are distributed into a
number of buckets or bins.
For example, attribute values can be discretized (separated) by applying equal-width
or equal-frequency binning, and then replacing each value by the bin mean, median
or boundaries.
It can be applied recursively to the resulting partitions to generate concept
hierarchies.
It does not use class information, therefore it is called as unsupervised discretization
technique.
It used to minimize the effects of small observation errors.

Identify outliers and smooth out noisy

1. Binning Method Cont.. Data
Cleaning
There are basically two types of binning approaches..
1. Equal width (or distance) binning :
 The simplest binning approach is to partition the range of the variable into k equal-width
intervals.
 The interval width is simply the range [Min, Max] of the variable divided by N,
 Width = Max – Min / N (Number of Bins)
 Example
 Data: 5,10,11,13,15, 35, 50, 55, 72, 92, 204, 215
 As per above formula we have Max=215, Min=5, Number of Bins=3
 70+5=75 (from 5 to 75) = Bin 1: 5,10,11,13,15, 35, 50, 55, 72
 70+75=145 (from 75 to 145) = Bin 2: 92
 70+145=215 (from 145 to 215) = Bin 3: 204, 215

2. Equal depth (or frequency) binning :

 In equal-frequency binning we divide the range [Max, Min] of the variable into intervals that
contain (approximately)Identify
equal number of points;
outliers and smoothequal frequency
out noisy data may not be possible due to
repeated values.
1. Binning Method Cont.. Data
Cleaning
Bin Operations
1. Smoothing by bin means
 In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
2. Smoothing by bin median
 In this method each bin value is replaced by its bin median value.
3. Smoothing by bin boundary
 In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries.
 Each bin value is then replaced by the closest boundary value.

Identify outliers and smooth out noisy

Binning Method Example – {Bin Means} Data
Cleaning
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34 Bin Means
 Step: 2 (4 + 8 + 9 + 15)/4 = 9
• Smoothing by bin means:
(21 + 21 + 24 + 25)/4 = 23
Bin 1: 9, 9, 9, 9 (26 + 28 + 29 + 34)/4 = 29
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29

Identify outliers and smooth out noisy

Binning Method Example – {Bin Boundaries} Data
Cleaning
Given data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Step: 1
Partition into equal-depth [n=4]:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
 Step: 2
• Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34

Identify outliers and smooth out noisy

2. Regression Data
Cleaning
Data smoothing can also be done by regression, a technique that conforms data
values to a function.
Regression analysis is a way to find trends in data & it is also called as
mathematically describes the relationship between independent variables and the
dependent variable.
It can be divided into two categories..
1. Linear regression :
 It involves finding the ―best‖ line to fit two attributes (or variables) so that one attribute can be used to
predict the other.
 In this, analysis on a single x variable for each dependent ―y‖ variable. For example: (x1, Y1).
2. Multiple linear regression :
 An extension of linear regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
 It uses multiple ―x‖ variables for each independent variable: (x1)1, (x2)1, (x3)1, Y1).

Identify outliers and smooth out noisy

3. Clustering Data
Cleaning
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
Cluster analysis as such is not an automatic task, but an iterative process
of knowledge discovery or interactive multi-objective optimization that involves trial
and failure.
It is often necessary to modify data preprocessing and model parameters until the
result achieves the desired properties.

Identify outliers and smooth out noisy

Correct Inconsistent Data Data
Cleaning
 With larger datasets, it can be difficult to find all of the inconsistencies.
 It contains similarity in codes or names.
 We can manually solve common mistakes like spelling, grammar, articles or use
other tools for it.
Resolve redundancy caused by data integration Data
Cleaning
 Data redundancy occurs in database systems which have a field that is repeated
in two or more tables.
 When customer data is duplicated and attached with each product bought, then
redundancy of data is known as inconsistency.
 So, the entity "customer" might appear with different values.
 Database normalization prevents redundancy and makes the best possible usage
of storage.
 The proper use of foreign keys can minimize data redundancy and reduce the
chance of destructive anomalies appearing.
Section - 5
Data Integration
Combines data from multiple sources into a coherent store
Schema integration: e.g., [Link]-id  [Link]-#
 Integrate metadata from different sources
Entity identification problem
 Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources are different
 Possible reasons: different representations, different scales, e.g., metric vs. British units
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
 Object identification: The same attribute or object may have different names in different
databases
 Derivable data: One attribute may be a ―derived‖ attribute in another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis and
covariance analysis.
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality.
Section - 6
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of
replacement values that each old value can be identified with one of the new values
Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 Min-max normalization
 Z-score normalization
 Normalization by decimal scaling
 Discretization: Concept hierarchy climbing
1. Min-Max Normalization Data
Transformation
Min max is a technique that helps
to normalizing the data.
It will scale the data between 0
Given data
and 1 or within specified range.
 Min : Minimum value = 16
Example  Max : Maximum value = 40
 V = Respective value of attributes.
In our example V1= 16, V2=20,
V3=30 & V4=40.
 NewMax = 1
 NewMin = 0
1. Min-Max Normalization Cont.. Data
Transformation
Exampl
e
For Age 16 : For Age 30 :

MinMax (v’) = (16 – 16)/(40-16) * (1 – 0) + MinMax (v’) = (30 – 16)/(40-16) * (1 – 0) +

0 0
= 0 / 24 * 1 = 14 / 24 * 1
=0 = 0.58
For Age 20 : For Age 40 :

MinMax (v’) = (20 – 16)/(40-16) * (1 – 0) + MinMax (v’) = (40 – 16)/(40-16) * (1 – 0) +

0 0
= 4 / 24 * 1 = 24 / 24 * 1
= 0.16 =1
Age After Min-max
normalization
16 0
20 0.16
30 0.58
40 1
2. Decimal Scaling Data
Transformation
In this technique we move the Examp
decimal point of values of the le CGPA Formula After Decimal
attribute. Scaling
2 2 / 10 0.2
This movement of decimal points
totally depends on the maximum 3 3 / 10 0.3
value among all values in the  We will check maximum value among our attribute
attribute. CGPA.
Value V of attribute A can be  Maximum value is 3 so, we can convert it into decimal
normalized by the following formula by dividing with 10. why 10?
 We will count total digits in our maximum value and
Normalized valueinteger
Where j is the smallest of attribute
such that
Max(|ν’|) then put 1.
 V’=<V1/ 10j
 After 1 we can put zeros equal to the length of
maximum value.
 Here 3 is maximum value and total digits in this value
is only 1 so, we will put one zero after 1.
3. Z-Score Normalization Data
Transformation
It is also called zero-mean normalization.
The essence of this technique is the data transformation by the values conversation
to a common scale where an average number equals zero and a standard deviation
is one.
To find z-score values..

Where μ: Mean, σ: Standard deviation

Example
 Let μ = 54,000, σ = 16,000
 Find z-score for 73,600,

Z-score for 73600:

1.225
Section - 7
Data Reduction
Why Data Reduction?
 A database/data warehouse may store terabytes of data.
 Complex data analysis may take a very long time to run on the complete data set.
 What is Data Reduction?
 Data reduction process reduces the size of data and makes it suitable and feasible for analysis.
 In the reduction process, integrity of the data must be preserved and data volume is reduced.
 There are many techniques that can be used for data reduction like
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
1. Dimensionality Reduction Data
Reduction
Dimensionality reduction, or dimension reduction, is the transformation of data
from a high-dimensional space into a low-dimensional space so that the low-
dimensional representation retains some meaningful properties of the original data,
ideally close to its intrinsic dimension.
The number of input variables or features for a dataset is referred to as its
dimensionality.
Dimensionality reduction refers to techniques that reduce the number of input
variables in a dataset.
Example
 Dimensional reduction can be discussed through a simple e-mail classification problem, where
we need to classify whether the e-mail is spam or not.
 This can involve a large number of features, such as whether or not the e-mail has a generic title,
the content of the e-mail, whether the e-mail uses a template, etc.
1. Dimensionality Reduction Cont.. Data
Reduction

A 3-D classification problem

can be hard to visualize,
whereas a 2-D one can be
mapped to a simple 2
dimensional space, and a 1-D
problem to a simple line.
2. Numerosity Reduction Data
Reduction
Numerosity Reduction is a data reduction technique which replaces the
original data by smaller form of data representation.
There are two techniques for numerosity reduction- Parametric and Non-
Parametric methods.
Parametric Methods
 For parametric methods, data is represented using some model.
 The model is used to estimate the data, so that only parameters of data are required to
be stored, instead of actual data.
 Regression and Log-Linear methods are used for creating such models.
Non-Parametric Methods
 These methods are used for storing reduced representations of the data
include histograms, clustering, sampling and data cube aggregation.
Regression Data
Reduction
Regression can be a simple linear regression or multiple linear regression.
When there is only single independent attribute, such regression model is
called simple linear regression and if there are multiple independent attributes,
then such regression models are called multiple linear regression.
In linear regression, the data are modeled to a fit straight line.
For example, a random variable y can be modeled as a linear function of
another random variable x with the equation y = ax+b
Where a and b (regression coefficients) specifies the slope and y-intercept of
the line, respectively.
In multiple linear regression, y will be modeled as a linear function of two or
more predictor (independent) variables.
Log-Linear Model Data
Reduction
Log-linear model can be used to estimate the probability of each data point in
a multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
This allows a higher-dimensional data space to be constructed from lower-
dimensional attributes.
Regression and log-linear model can both be used on sparse data (most of
the elements are zero), although their application may be limited.
Non-Parametric Methods Data
Reduction
Histograms
 Histogram is the data representation in terms of frequency.
 It uses binning to approximate data distribution and is a popular form of data reduction.
Clustering
 Clustering divides the data into groups/clusters, it partitions the whole data into different
clusters.
 In data reduction, the cluster representation of the data are used to replace the actual
data, It also helps to detect outliers in data.
Sampling
 Sampling can be used for data reduction because it allows a large data set to be
represented by a much smaller random data sample (or subset).
Data Cube Aggregation
 Data cube aggregation involves moving the data from detailed level to a fewer number
of dimensions.
 The resulting data set is smaller in volume, without loss of information
necessary for the analysis task.
3. Data Compression Data
Reduction
Data Compression is a reduction in the number of bits needed to represent data.
Compressing data can save storage capacity, speed up file transfer, and decrease
costs for storage hardware and network bandwidth.
Compressing data can be a lossless or lossy process.
 Lossless compression
 It enables the restoration of a file to its original state, without the loss of a single bit of data, when the file
is uncompressed.
 Lossless compression is the typical approach with executables, as well as text and spreadsheet files,
where the loss of words or numbers would change the information.
 Lossy compression
 It permanently eliminates bits of data that are redundant, unimportant or imperceptible.
 Lossy compression is useful with graphics, audio, video and images, where the removal of some data
bits has little or no discernible effect on the representation of the content.

Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
54 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
66 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
63 pages
Understanding Data Types and Attributes
No ratings yet
Understanding Data Types and Attributes
124 pages
Data Preprocessing: Attribute Types & Stats
No ratings yet
Data Preprocessing: Attribute Types & Stats
8 pages
Data Sets and Attribute Types Explained
No ratings yet
Data Sets and Attribute Types Explained
41 pages
Five-Number Summary and IQR Analysis
No ratings yet
Five-Number Summary and IQR Analysis
31 pages
Data Types and Statistical Measures
No ratings yet
Data Types and Statistical Measures
57 pages
Data Pre-processing in Data Mining
No ratings yet
Data Pre-processing in Data Mining
60 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
42 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
35 pages
Understanding Data Mining Concepts
No ratings yet
Understanding Data Mining Concepts
50 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
100 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
16 pages
Data Pre-processing Techniques in DM
No ratings yet
Data Pre-processing Techniques in DM
60 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
23 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
86 pages
Object.assign vs Spread Operator Explained
No ratings yet
Object.assign vs Spread Operator Explained
30 pages
Foundations of Data Science: Describing Data
No ratings yet
Foundations of Data Science: Describing Data
21 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
78 pages
Types of Data Attributes in Data Mining
No ratings yet
Types of Data Attributes in Data Mining
25 pages
Understanding Data Types and Attributes
No ratings yet
Understanding Data Types and Attributes
90 pages
Understanding Data Attributes and Types
No ratings yet
Understanding Data Attributes and Types
83 pages
Data Mining and Analytics Overview
No ratings yet
Data Mining and Analytics Overview
46 pages
Data Objects and Attribute Types Overview
No ratings yet
Data Objects and Attribute Types Overview
43 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
36 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
62 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
35 pages
Understanding Descriptive Analytics Data
No ratings yet
Understanding Descriptive Analytics Data
46 pages
Understanding Data Exploration Techniques
No ratings yet
Understanding Data Exploration Techniques
61 pages
Descriptive Analytics in Retail Data
No ratings yet
Descriptive Analytics in Retail Data
44 pages
Data Mining: Understanding Attributes and Objects
No ratings yet
Data Mining: Understanding Attributes and Objects
29 pages
Machine Learning Data Types Overview
No ratings yet
Machine Learning Data Types Overview
84 pages
Types of Attributes in Data Mining
No ratings yet
Types of Attributes in Data Mining
68 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
69 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
29 pages
Data Mining: Attributes and Analysis Techniques
No ratings yet
Data Mining: Attributes and Analysis Techniques
32 pages
Machine Learning Data Preparation Guide
No ratings yet
Machine Learning Data Preparation Guide
40 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
56 pages
Statistical Data Descriptions in Mining
No ratings yet
Statistical Data Descriptions in Mining
40 pages
Data Objects and Attribute Types Explained
No ratings yet
Data Objects and Attribute Types Explained
28 pages
Data Objects and Attribute Types in Mining
No ratings yet
Data Objects and Attribute Types in Mining
50 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
34 pages
Understanding Data and Its Types
No ratings yet
Understanding Data and Its Types
51 pages
Understanding Machine Learning Data Types
No ratings yet
Understanding Machine Learning Data Types
72 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
44 pages
Attribute Relevance in Data Mining
No ratings yet
Attribute Relevance in Data Mining
27 pages
Understanding Mode in Statistics
No ratings yet
Understanding Mode in Statistics
17 pages
Introduction to Statistical Methods Overview
No ratings yet
Introduction to Statistical Methods Overview
54 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
22 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
78 pages
Data Objects and Attribute Types Explained
No ratings yet
Data Objects and Attribute Types Explained
19 pages
Understanding Data Attributes and Normalization
No ratings yet
Understanding Data Attributes and Normalization
35 pages
Data Preprocessing for Structured Data
No ratings yet
Data Preprocessing for Structured Data
39 pages
Data Exploration and Preprocessing Guide
No ratings yet
Data Exploration and Preprocessing Guide
81 pages
Machine Learning Data Preparation Guide
No ratings yet
Machine Learning Data Preparation Guide
43 pages
Data Mining Group Project Guidelines
No ratings yet
Data Mining Group Project Guidelines
50 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
56 pages
Conversion IKRPG To MK3
100% (1)
Conversion IKRPG To MK3
5 pages
Ennis Flint Thermoplastic 884326 Product Data
No ratings yet
Ennis Flint Thermoplastic 884326 Product Data
1 page
Antimycobacterial Drug Guidelines
No ratings yet
Antimycobacterial Drug Guidelines
36 pages
Dream Planning and Cost Estimation Workshop
No ratings yet
Dream Planning and Cost Estimation Workshop
4 pages
Debate Introduction Guide for Students
100% (2)
Debate Introduction Guide for Students
4 pages
Research Internships for Gifted Education
No ratings yet
Research Internships for Gifted Education
4 pages
NSSA Pensioners Advocacy Hearing Update
No ratings yet
NSSA Pensioners Advocacy Hearing Update
4 pages
Custom Lemon Cupcakes for Teens
No ratings yet
Custom Lemon Cupcakes for Teens
14 pages
First Grade Observation Summary
No ratings yet
First Grade Observation Summary
1 page
Architect Curriculum Vitae Example
No ratings yet
Architect Curriculum Vitae Example
6 pages
Oncomine Myeloid Assay User Guide
No ratings yet
Oncomine Myeloid Assay User Guide
84 pages
JavaScript Looping Techniques Explained
No ratings yet
JavaScript Looping Techniques Explained
8 pages
Job Costing vs. Process Costing Explained
No ratings yet
Job Costing vs. Process Costing Explained
6 pages
Essential Linux Commands and Shell Scripts
No ratings yet
Essential Linux Commands and Shell Scripts
1 page
DSM-5® Casebook for Child Mental Health
No ratings yet
DSM-5® Casebook for Child Mental Health
16 pages
Understanding Academic English Essentials
100% (1)
Understanding Academic English Essentials
136 pages
Agile E0 Course Insights and Answers
No ratings yet
Agile E0 Course Insights and Answers
8 pages
Future Simple Tense: "Will" Guide
No ratings yet
Future Simple Tense: "Will" Guide
6 pages
Understanding Grievance Management
No ratings yet
Understanding Grievance Management
24 pages
Sandburg Poems PDF
No ratings yet
Sandburg Poems PDF
6 pages
Prepositions for Expressing Location
No ratings yet
Prepositions for Expressing Location
4 pages
Regular and Irregular Verbs Explained
No ratings yet
Regular and Irregular Verbs Explained
15 pages
Name Manager User Guide for Excel
No ratings yet
Name Manager User Guide for Excel
25 pages
Effective Cash Management Strategies
No ratings yet
Effective Cash Management Strategies
13 pages
Database System Concepts Overview
No ratings yet
Database System Concepts Overview
44 pages
Peterborough United Training Ground Info
100% (1)
Peterborough United Training Ground Info
26 pages
Overview of Computer Generations
No ratings yet
Overview of Computer Generations
26 pages
10665-W LTV Series SRV Manual PDF
No ratings yet
10665-W LTV Series SRV Manual PDF
314 pages
Request for Tree-Planting Event Approval
No ratings yet
Request for Tree-Planting Event Approval
2 pages
Pocket Programming Guide: Radio Technology Somfy®
No ratings yet
Pocket Programming Guide: Radio Technology Somfy®
144 pages

Data Preprocessing Techniques Explained

Uploaded by

Data Preprocessing Techniques Explained

Uploaded by

 Topics

 Quality decisions must be based on quality data.

Data preparation, cleaning and

First, find the sum of the data.

First, arrange the data in ascending

First, arrange the data in ascending

12, 15, 11, 11, 7, 13 12, 12 15, 11, 11, 7, 13,

First, arrange the data in ascending order.

Step : X X - Mean Value (X – Mean)2 Step :

Definition • Mode: The mode is the number that occurs

• Standard Deviation: The Standard Deviation is

Quantitative is an adjective that simply means something that can be measured.

1) Discrete Attribute

2) Continues Attribute

Identify outliers and smooth out noisy

2. Equal depth (or frequency) binning :

Identify outliers and smooth out noisy

Identify outliers and smooth out noisy

Identify outliers and smooth out noisy

Identify outliers and smooth out noisy

Identify outliers and smooth out noisy

MinMax (v’) = (16 – 16)/(40-16) * (1 – 0) + MinMax (v’) = (30 – 16)/(40-16) * (1 – 0) +

MinMax (v’) = (20 – 16)/(40-16) * (1 – 0) + MinMax (v’) = (40 – 16)/(40-16) * (1 – 0) +

Where μ: Mean, σ: Standard deviation

Z-score for 73600:

A 3-D classification problem

You might also like