0% found this document useful (0 votes)
188 views

COS10022 - Lecture 03 - Data Preparation PDF

The document provides an overview of data preparation, which is an important phase in the data analytics lifecycle. It discusses the major tasks in data preparation, including data cleaning, integration, reduction, and transformation. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques like binning, regression, and outlier analysis. The document also covers the importance of data quality and preparation.

Uploaded by

Papersdock Taha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
188 views

COS10022 - Lecture 03 - Data Preparation PDF

The document provides an overview of data preparation, which is an important phase in the data analytics lifecycle. It discusses the major tasks in data preparation, including data cleaning, integration, reduction, and transformation. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques like binning, regression, and outlier analysis. The document also covers the importance of data quality and preparation.

Uploaded by

Papersdock Taha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Data

Preparation
COS10022
Data Science Practices
Teaching Materials
Co-developed by:
Pei-Wei Tsai ([email protected]
WanTze Vong ([email protected])
Yakub Sebastian ([email protected])
Data Analytics Lifecycle

Phase 2: Data Preparation


Given the presence of an
analytics sandbox, the data
science teamwork with data and
perform analytics for the
duration of the project. The
team performs ETLT to get the
data into the sandbox and
familiarize themselves with the
data thoroughly.
Outline

• OVERVIEW
• Data Quality
• Major Tasks in Data Preparation

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation

4
Why is Data Preparation Important?
• Data have quality if they satisfy the requirements of the intended use.

• Factors comprising data quality:

Accuracy Completeness Consistency Timeliness Believability Interpretability

• Degree to • Degree to • Degree to • Degree to • Degree to • Degree to


which the which which the which the which the which the
data necessary data is equal data is data is data can be
represents data is within and available at trusted by easily
the reality. available for between the time it is users. understood.
use. datasets. needed.

• Data Preparation is sometimes called Data Wrangling or Data Munging.


Major Tasks in Data Preparation
Data Cleaning Data Integration

To fill in missing values, To merge data from multiple


smooth noisy data, identify data stores to help reduce
or remove outliers, and redundancies and
resolve inconsistencies. inconsistencies in the
resulting dataset.

Data Transformation
Data Reduction
To modify the source data into
different formats in terms of To obtain a reduced
data types and values so that it representation of the dataset
is useful for mining and to that is much smaller in volume,
make the output easier to yet produces the same (or
almost the same( analytical
understand.
results.
Outline
• Overview
• Data Quality
• Major Tasks in Data Preprocessing

• DATA CLEANING
• Data Integration

• Data Reduction

• Data Transformation

7
Data Cleaning
Real-world data is DIRTY.

1. Incomplete Data:
• Missing attribute values, lacking certain attributes of
interest, or containing only aggregate data
• E.g. Occupation = “”

2. Noisy Data:
• Containing errors or outliers A mistake or a
• E.g. Salary = “-100” millionaire?

3. Inconsistent Data: Missing values


• Containing discrepancies in codes or names
• E.g. Discrepancy between duplicate records
• E.g. Was rating “1, 2, 3”, Now rating “A, B, C” Inconsistent
• E.g. Age = “36”, Birthday = “31/08/1984” duplicate entries
Incomplete Data
• Incomplete data can occur for a number of reasons:
• Attributes of interest may not always be available.
• Relevant data may not be recorded:
• Because they were not considered important at the time of entry
• Due to misunderstanding or equipment malfunctions.
• Data that were inconsistent with other recorded data may have been
deleted.
• The recording of the data history or modifications may have been
overlooked.

• Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred.
Incomplete Data
• How to Handle Missing Data?
1. Ignore the tuple
• This is usually done when class label is missing (assuming the mining task involves classification).
• This method is not very effective, unless the tuple contains several attributes with missing values.

2. Fill in the missing values manually


• This method is time consuming and may not be feasible given a large dataset with many missing values.

3. Use a global constant to fill in the missing values


• Replace all missing attribute values using the same constant (such as “Unknown”, “N/A”).
• A mining program may mistakenly think that they form an interesting concept since they all have a value in
common.
Incomplete Data
• How to Handle Missing Data?
4. Use a measure of central tendency for the
attribute (e.g. the mean or median) to fill in
the missing values
• For normal data distributions, the mean can be used,
while skewed data distribution should employ the
median.

5. Use the attribute mean or medium for all


samples belonging to the same class as the
given tuples
• E.g. If classifying customers according to credit risk,
replace the missing value with the average income
value for customers in the same credit risk category as
that of the given tuple.
Incomplete Data
• How to Handle Missing Data?
6. Use the most probable value to fill in the
missing value
• This may be determined with regression, interference-
based tools using Bayesian formalism or decision tree
induction.
• E.g. Using the other passenger attributes in the titanic
dataset, you may construct a decision tree to predict
the missing values for sibsp.
Noisy Data
• Noise is a random error or variance in a
measured variable.
Att. Noise Class Noise

• Incorrect attribute values may be due to: Att1 Att2 Class


• faulty data collection instruments 0.25 Red Positive Class Noise:
• data entry problems 0.25 Red Negative
• Contradictory examples
• Mislabeled examples
• data transmission problems 0.99 Green Negative
• technology limitation Attribute Noise:
1.02 Green Positive
• inconsistency in naming convention • Erroneous values
2.05 ? Negative • Missing values
? Green Positive
• Other data problems which require data
cleaning:
• duplicate records
• incomplete data
• inconsistent data

13
Noisy Data
• How to Handle Noisy Data?
1. Binning
• This method smooth a sorted data value by consulting its “neighborhood”, that is, the
values around it.
• The sorted values are distributed into a number of “buckets”, or “bins”.

The data for price are first sorted and then partitioned into equal
frequency bins of size 3 (i.e. each bin contains three values).

Each original value in a bin is replaced by the mean value of the


bin (i.e. the value 9).

The min. and max. values in a given bin are identified. Each bin
value is then replaced by the closed boundary value.
Noisy Data
• How to Handle Noisy Data?
2. Regression
• A technique that conforms data values to a function.
• E.g. Linear regression involves finding the “best” line to fit two attributes so that one
attribute can be used to predict the other.
Noisy Data
• How to Handle Noisy Data?
3. Outlier analysis
• Outliers may be detected by clustering, for example, where similar values are organized
into groups, or “clusters”.

Each cluster centroid is marked with a “+”, representing the


average point in space for that cluster.

Outliers may be detected as values that fall outside of the


sets of clusters.

Fig. A 2-D plot of customer data with


respect to customer locations in a city,
showing three data clusters.
Data Discrepancies
• The first step in data cleaning as a process is discrepancy detection.

• Discrepancies can be caused by:


• Poorly designed data entry forms
• Human errors in data entry
• Deliberate errors
• e.g. respondents not wanting to divulge information about themselves
• Data decay
• e.g. outdated addresses
• Errors in instrumentation devices that record data
• System errors
• Inconsistencies due to data integration
• e.g. where a given attribute can have different names in different databases
Data Discrepancies
• How to Detect Data Discrepancies?
1. Metadata
• Use any knowledge that you may already have regarding properties of the data
• E.g. What are acceptable values for each attribute? Do all values fall within the expected
range? What are data type and domain of each attribute?

2. Check uniqueness rule, consecutive rule and null rule


• Unique rule: Each value of the given attribute must be different from all other values for
that attribute
• Consecutive rule: There can be no missing values between the lowest and highest values
for the attribute, and that all values must also be unique.
• Null rule: Specifies the use of blanks, question marks, special characters, or other strings
that may indicate the null condition.
Data Discrepancies
• How to Detect Data Discrepancies?
3. Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g. postal code, spell-check) to detect
errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect violators
(e.g. correlation and clustering to find outliers)
Outline
• Overview
• Data Quality
• Major Tasks in Data Preprocessing

• Data Cleaning

• DATA INTEGRATION
• Data Reduction

• Data Transformation

20
Data Integration

• Data integration combines data from multiple sources (multiple databases, data cubes, or flat files)
into a coherent store, as in data warehousing.

• How can equivalent real-world entities from multiple data sources be matched up? This is referred
to as the Entity Identification Problem.
• E.g.: Bill Clinton = William Clinton
• E.g.: customer_id in one database = cust_number in another database

• Data integration can help detect and resolve data value conflicts.
• For the same real world entity, attribute values from different sources are different.
• Possible reasons: different representations, different scales (E.g. Metric vs. British units)

21
Data Integration
• Redundant data often occurs when integrating multiple databases.
• Object identification: The same attribute or object may have different names in different
databases
• Derivable data: One attribute may be a “derived” attribute in another table (E.g. annual
revenue)

• Redundant attributes may be able to be detected by correlation analysis.


• The analysis measure how strongly one attribute implies the other, based on the available data.
• For categorical data, Χ2 (Chi-Square) test is used.
• For numerical data, Correlation Coefficient and Covariance are used.

22
Data Integration
• How to Detect Redundant Attributes?
1. Correlation Coefficient (r) for Numerical Data
• Also called Pearson’s Product Moment Coefficient.

∑i =1 (ai − A)(bi − B) ∑
n n
(ai bi ) − n A B
rA, B = = i =1
(n − 1)σ Aσ B (n − 1)σ Aσ B

where n is the number of tuples, A and B are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.
• rA,B > 0 : Positively correlated
• rA,B = 0 : Independent
• rAB < 0 : Negatively correlated
Data Integration
• Visually evaluating correlation using scatter
plots
• Scatter plots showing the correlation
coefficient from -1 to 1.
• r = 1.0 : A perfect positive relationship
• r = 0.8 : A fairly strong positive relationship
• r = 0.6 : A moderate positive relationship
• r = 0.0 : No relationship
• r = -1.0 : A perfect negative relationship
Data Integration
• How to Detect Redundant Attributes?
2. Covariance (Cov) for Numerical Data
• Consider two numeric attributes A and B, and a set of n observations {(a1, b1), …, (an, bn)}.
• The mean values of A and B, are also known as the expected values of A and B, that is:

• The covariance between A and B is defined as:

• It can be simplified in computation as:


Data Integration
• Visually evaluating covariance between two variables
using scatter plot.

• Cov(A, B) < 0 : A and B tend to move in opposite direction

• Cov(A, B) > 0 : A and B tend to move in the same direction

• Cov(A, B) = 0 : A and B are independent.


• Note that: Zero covariance are not necessarily mean that the
variables are independent. A non-linear relationship can exist that
still would result in covariance value of zero.
Data Integration
EXAMPLE
The table below presents a simplified example of stock prices observed at five time points for
AllElectronics and HighTech, some high tech company. If the stocks are affected by the same
industry trends, will their price rise or fall together?

Cov(AllElectronics, HighTech) > 0, the stock prices for both companies rise together.
Data Integration
• How to Detect Redundant Attributes?
3. Chi-Squared (Χ2) test for Categorical Data
• The larger the Χ2 value, the more likely the variables are related.
• The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count.

Where:
Attribute A has c distinct values;
Attribute B has r distinct values;
eij is the expected frequency;
oij is the observed frequency;
Data Integration
EXAMPLE
Suppose that a group of 1500 people was surveyed. The gender of each person was
noted. Each person was polled as to whether his or her preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender and
preferred reading. The observed frequency (or count) of each possible joint event is
summarized in the contingency table
Data Integration
EXAMPLE (Cont.) For 1-degree of freedom, the Χ2 value
needed to reject the hypothesis at the
0.001 significant level is 10.83.

Hypothesis: Gender and preferred reading


are independent.

The degree of freedom is:


(r-1)(c-1) = (2-1)(2-1) = 1.

Result: 507.83 > 10.83


So hypothesis is rejected.

Conclusion: Gender and preferred reading


are strongly correlated.
References
• EMC Education Services. (2015). Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data. Wiley.

• Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and
techniques. Elsevier.
Week 8 Lecture 8 (Part 2)

Data
Preparation
COS10022
Introduction to Data Science
Outline
• Overview
• Data Quality
• Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• DATA REDUCTION
• Data Transformation

33
Data Reduction
• Why data reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on
the complete data set
• What is data reduction?
• Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
Data Reduction Strategies
1. Data cube aggregation
• Aggregation operations are applied to the data in the construction of a data cube.
2. Attribute subset selection
• Irrelevant, weakly relevant, or redundant attributes or dimensions may be
detected and removed.

3. Dimensionality reduction
• Encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction
• The data are replaced or estimated by alternative, smaller data representations
5. Discretization and concept hierarchy generation
• Raw data values for attributes are replaced by ranges or higher conceptual levels.
Data Cube Aggregation

• Data cubes store multidimensional analysis of sales


data with respect to annual sales per item type for
• These data consist of the AllEletronic sales per quarter,
each AllElectronic branch.
for the years 2002 to 2004.
• Each cell holds an aggregate data value,
• The data can be aggregated so that the resulting data
corresponding to the data point in multidimensional
summarize the total sales per year instead of per
space.
quarter.
• Data cubes provide fast access to precomputed,
• The resulting dataset is smaller in volume, without loss of
summarized data, thereby benefiting on-line
information necessary for the analysis task.
analytical processing as well as data mining.
Attribute Subset Selection
• Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes.

• Heuristic methods that explore a reduced search space are commonly used to find a ‘good’ subset
of the original attributes.
• Stepwise forward selection

• Stepwise backward elimination

• Combination of forward selection and backward elimination

• Decision tree induction

• The “best” (and “worst”) attributes are typically determined using tests of statistical significance,
which assume that the attributes are independent of one another.

• Other attribute evaluation measures such as information gain is used in building decision trees
for classification.
Attribute Subset Selection
Stepwise forward selection
1. Start with an empty set of attributes
2. Determine the best of the original
attributes and add it to the reduced set.
3. At each step, add the best of the
remaining original attributes to the
reduced set.

Stepwise backward elimination Forward selection + Backward elimination


1. Start with the full set of attributes 1. Start with an empty set of attributes
2. At each step, remove the worst attribute 2. At each step, add the best attribute to the reduced set
remaining in the set. and removes the worst from among the remaining
attributes.
Attribute Subset Selection
Decision Tree Induction

Decision tree induction constructs a flowchart-like structure


where each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of the test,
and each external (leaf) node denotes a class prediction. At
each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.

When decision tree induction is used for attribute subset


selection, a tree is constructed from the given data. All
attributes that do not appear in the tree are assumed to be
irrelevant. The set of attributes appearing in the tree form the
reduced subset of attributes.
Dimensional Reduction
• In dimensionality reduction, data encoding or transformations are applied so as to obtain a reduced
or “compressed” representation of the original data.

If the original data can be


reconstructed from the
compressed data without any If only an approximation of the
loss of information, the data original data can be
reduction is called lossless. reconstructed from the
compressed data, the data
reduction is called lossy.

• An example of dimensional reduction method: Principal Component Analysis (PCA)


• PCA main ideas in 5 minutes: https://www.youtube.com/watch?v=HMOI_lkzW08
Dimensional Reduction
• Principal Component Analysis (PCA) reduces the dimensionality (the number of features) of a
dataset by maintaining as much variance as possible.

• Example:
Gene Expression

• The original expression by 3 genes is projected to two new dimensions. Such two-dimensional
visualization of the samples allow us to draw qualitative conclusions about the separability of
experimental conditions (marked by different colors).
Numerosity Reduction
• Numerosity reduction techniques replace the original data volume by choosing
alternative, smaller forms of data representation.
• Parametric methods
• These methods assume that the data fits some models.
• Models such as regression and log-linear model are used to estimate the data, so that only the
data parameters need to be stored, instead of the actual data.

• Non-parametric methods
• These methods do not assume models.
• Methods such as histogram, clustering, sampling and data cube aggregation are used to store
reduced representations of data
Numerosity Reduction: Parametric Method
Linear Regression Multiple Linear Regression Log-Linear Model
The data are modelled to fit a MLR allows a response variable Y to be The model takes the form of a function
straight line. The least-square modelled as a linear function of two or whose logarithm is a linear combination
method is used to fit the line. more predictor variables. of the parameters of the model.

Y = b0 + b1 X1 Y = b0 + b1 X1 + b2 X2 ln Y = b0 + b1X1 + Ԑ
Numerosity Reduction: Non-Parametric Methods

Binning Histogram
A top-down unsupervised splitting technique based on An unsupervised method to partition the values of an
a specified number of bins. attribute into disjoint ranges called buckets or bins.
Numerosity Reduction: Non-Parametric Methods
Clustering

• A clustering algorithm can be applied to


discretize a numerical attribute by
partitioning the values of that attributes
into clusters or groups.
• Unsupervised, top-down split or bottom-up
merge)
• Partition dataset into clusters based on
similarity
• Effective if data is clustered but not if data is
“smeared”
• Cluster analysis using k-means (Lecture 6)
Numerosity Reduction: Non-Parametric Methods
Sampling Suppose that a large dataset, D, contain N tuples.

SRSWOR: All tuples


• Allows a large dataset to be represented are equally likely to
by a much smaller random data sample be sampled.
(or subset).
SRSWR: After a tuple
• Sampling methods: is drawn, it is placed
1. Sampling random sample without back in D so that it
replacement (SRSWOR) of size s. may be drawn again.
2. Sampling random sample with
replacement (SRSWR) of size s.
3. Cluster sample.
4. Stratified sample.
Numerosity Reduction: Non-Parametric Method
Cluster sample: If the tuples in D are grouped into M Stratified sample: If D is divided into mutually disjoint
mutually disjointed “clusters”, then a simple random parts called strata, a stratified sample of D is
sample of s clusters can be obtained, where s < M. generated by obtaining an SRS at each stratum. This
method is helpful when the data are skewed.
Outline
• Overview
• Data Quality
• Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• DATA TRANSFORMATION

48
Data Transformation
• Data transformation strategies:
1. Smoothing: Remove noise from data using techniques such as binning, regression and
clustering.
2. Attribute/feature construction: construct new attributes from the given set of attributes.
3. Aggregation: Construct data cubes
4. Normalization: Scale the attribute date to fall within a smaller, specified range such as -1.0 to
1.00, or 0.0 to 1.0.
5. Discretization: Replace raw values of a numeric attribute (e.g. age) with interval label (e.g. 0-10
11-12) or conceptual labels (e.g. youth, adult, senior).
6. Concept hierarchy generation for nominal data: Generalize attributes such as street to higher-
level concepts such as city or country.
Normalization
• Why normalization?
• Normalizing the data attempts to give all attributes an equal weights.

• Particularly useful for classification algorithms:


• When using neural network backpropagation algorithm for classification mining,
normalizing the input values for each attribute will speed up the learning phase.

• When using distance-based method for clustering, normalization helps prevent attributes
with initially large range (e.g. income) from outweighing attributes with initially smaller
ranges (e.g. binary attributes).
• Examples:
• Income has range $3,000-$20,000
• Age has range 10-80
• Gender has domain Male/Female
Normalization
Min-Max Z-score Decimal scaling
Transforms the data into a desired range, usually Useful when the actual min and max Transform data into a range
[0, 1]. of attribute are unknown. between [-1, 1].

v − minA v − µA v
v' = (new _ maxA − new _ minA) + new _ minA v' = v' =
maxA − minA σ A 10 j

Where, [minA, maxA] is the initial range and Where μA and σA are the mean and Where j is the smallest integer
[new_minA, new_maxA] is the new range. standard deviation of the initial data such that Max(|ν’|) < 1.
values.

Let income range $12,000 to $98,000 normalized Let μ = $54,000, σ = $16,000. Then Suppose that the values of A
to [0.0, 1.0]. Then $73,000 is mapped to: $73,600 is transformed to: range from -986 to 917. Divide
73,600 − 54,000
each value by 1000 (i.e. j = 3):
73,600 − 12,000
(1.0 − 0) + 0 = 0.716 = 1.225 -986 normalizes to -0.986 and
98,000 − 12,000 16,000
917 normalizes to 0.917.
Discretization
Attribute
• Data discretization transforms
numeric data by mapping values to
Categorical Numerical
interval or concept label.

Nominal Ordinal Continuous Discrete


• Discretization techniques:
Categories Categories Takes any Integer
• Binning, Histogram analysis, are mutually are mutually value in a values,
exclusive exclusive range of typically
Cluster analysis, Decision tree and and ordered. counts.
values.
analysis, Correlation analysis unordered.

E.g. Sex E.g. Disease E.g. Weight E.g. Days


(male/femal Stage (mild/ in kg, Height sick per year
• For nominal data: e), Blood moderate/ in cm
Group severe)
• Concept hierarchy (A/B/AB/O)
Discretization

Binning Histogram
A top-down unsupervised splitting technique based on An unsupervised method to partition the values of an
a specified number of bins. attribute into disjoint ranges called buckets or bins.
Discretization
Cluster Analysis

• A clustering algorithm can be applied to


discretize a numerical attribute by
partitioning the values of that attributes
into clusters or groups.
• Unsupervised, top-down split or bottom-up
merge)
• Partition dataset into clusters based on
similarity
• Effective if data is clustered but not if data is
“smeared”
• Cluster analysis using k-means (Lecture 6)
Discretization
• Decision tree analysis
• Use a top-down splitting approach
• Supervised: Make use of the class
label (e.g. cancerous vs. benign)
• Using entropy to determine split
point (discretization point: the
resulting partition contains as many
tuples of the same class as possible)
Discretization
Correlation analysis
• Use a bottom-up merge approach
• Supervised: Make use of the class label
(e.g. spam vs. genuine)
• ChiMerge: Find the best neighboring
intervals (those having similar
distributions of classes, i.e., low χ2
values) to merge.
Discretization (Concept hierarchy generation for categorical data)
• Nominal attributes have a finite (but possibly
large) number of distinct values, with no ordering
among the values.
• E.g. geographic_location, job_category, and item_type

• Concept hierarchies can be used to transform


data into multiple levels of granularity.

• Concept hierarchy formation:


• Recursively reduce the data by collecting and replacing
low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
Discretization (Concept hierarchy generation for categorical data)
Four methods for the generation of concept hierarchies:
1. Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• E.g. street < city < state < country

2. Specification of a portion of a hierarchy by explicit data grouping


• E.g. {Urbana, Champaign, Chicago} < Illinois

3. Specification of a set of attributes


• System automatically generates partial ordering by analysis of
the number of distinct values The attribute with the most
• E.g. street < town < country < country distinct values is placed at
the lowest level of the
hierarchy
4. Specification of only a partial set of attributes
• E.g. only street < town, not others
Summary
• Data quality is defined in terms of accuracy, completeness, consistency,
timeliness, believability, and interpretability. These qualities are assessed based
on the intended use of the data.
• Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data. Data cleaning is
usually performed as an iterative two-step process consisting of discrepancy
detection and data transformation.
• Data integration combines data from multiple sources to form a coherent data
store. The resolution of semantic heterogeneity, metadata, correlation analysis,
tuple duplication detection, and data conflict detection contribute to smooth
data integration.
Summary
• Data reduction techniques obtain a reduced representation of the data while minimizing
the loss of information content. These include methods of dimensionality reduction,
numerosity reduction, and data compression.
• Data transformation routines convert the data into appropriate forms for mining. For
example, in normalization, attribute data are scaled so as to fall within a small range such
as 0.0 to 1.0. Other examples are data discretization and concept hierarchy generation.
• Data discretization transforms numeric data by mapping values to interval or concept
labels. Such methods can be used to automatically generate concept hierarchies for the
data, which allows for mining at multiple levels of granularity. Discretization techniques
include binning, histogram analysis, cluster analysis, decision tree analysis, and
correlation analysis. For nominal data, concept hierarchies may be generated based on
schema definitions as well as the number of distinct values per attribute.
References
• EMC Education Services. (2015). Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data. Wiley.

• Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and
techniques. Elsevier.

You might also like