COS10022 - Lecture 03 - Data Preparation PDF
COS10022 - Lecture 03 - Data Preparation PDF
Preparation
COS10022
Data Science Practices
Teaching Materials
Co-developed by:
Pei-Wei Tsai ([email protected]
WanTze Vong ([email protected])
Yakub Sebastian ([email protected])
Data Analytics Lifecycle
• OVERVIEW
• Data Quality
• Major Tasks in Data Preparation
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
4
Why is Data Preparation Important?
• Data have quality if they satisfy the requirements of the intended use.
Data Transformation
Data Reduction
To modify the source data into
different formats in terms of To obtain a reduced
data types and values so that it representation of the dataset
is useful for mining and to that is much smaller in volume,
make the output easier to yet produces the same (or
almost the same( analytical
understand.
results.
Outline
• Overview
• Data Quality
• Major Tasks in Data Preprocessing
• DATA CLEANING
• Data Integration
• Data Reduction
• Data Transformation
7
Data Cleaning
Real-world data is DIRTY.
1. Incomplete Data:
• Missing attribute values, lacking certain attributes of
interest, or containing only aggregate data
• E.g. Occupation = “”
2. Noisy Data:
• Containing errors or outliers A mistake or a
• E.g. Salary = “-100” millionaire?
• Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred.
Incomplete Data
• How to Handle Missing Data?
1. Ignore the tuple
• This is usually done when class label is missing (assuming the mining task involves classification).
• This method is not very effective, unless the tuple contains several attributes with missing values.
13
Noisy Data
• How to Handle Noisy Data?
1. Binning
• This method smooth a sorted data value by consulting its “neighborhood”, that is, the
values around it.
• The sorted values are distributed into a number of “buckets”, or “bins”.
The data for price are first sorted and then partitioned into equal
frequency bins of size 3 (i.e. each bin contains three values).
The min. and max. values in a given bin are identified. Each bin
value is then replaced by the closed boundary value.
Noisy Data
• How to Handle Noisy Data?
2. Regression
• A technique that conforms data values to a function.
• E.g. Linear regression involves finding the “best” line to fit two attributes so that one
attribute can be used to predict the other.
Noisy Data
• How to Handle Noisy Data?
3. Outlier analysis
• Outliers may be detected by clustering, for example, where similar values are organized
into groups, or “clusters”.
• Data Cleaning
• DATA INTEGRATION
• Data Reduction
• Data Transformation
20
Data Integration
• Data integration combines data from multiple sources (multiple databases, data cubes, or flat files)
into a coherent store, as in data warehousing.
• How can equivalent real-world entities from multiple data sources be matched up? This is referred
to as the Entity Identification Problem.
• E.g.: Bill Clinton = William Clinton
• E.g.: customer_id in one database = cust_number in another database
• Data integration can help detect and resolve data value conflicts.
• For the same real world entity, attribute values from different sources are different.
• Possible reasons: different representations, different scales (E.g. Metric vs. British units)
21
Data Integration
• Redundant data often occurs when integrating multiple databases.
• Object identification: The same attribute or object may have different names in different
databases
• Derivable data: One attribute may be a “derived” attribute in another table (E.g. annual
revenue)
22
Data Integration
• How to Detect Redundant Attributes?
1. Correlation Coefficient (r) for Numerical Data
• Also called Pearson’s Product Moment Coefficient.
∑i =1 (ai − A)(bi − B) ∑
n n
(ai bi ) − n A B
rA, B = = i =1
(n − 1)σ Aσ B (n − 1)σ Aσ B
where n is the number of tuples, A and B are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB
cross-product.
• rA,B > 0 : Positively correlated
• rA,B = 0 : Independent
• rAB < 0 : Negatively correlated
Data Integration
• Visually evaluating correlation using scatter
plots
• Scatter plots showing the correlation
coefficient from -1 to 1.
• r = 1.0 : A perfect positive relationship
• r = 0.8 : A fairly strong positive relationship
• r = 0.6 : A moderate positive relationship
• r = 0.0 : No relationship
• r = -1.0 : A perfect negative relationship
Data Integration
• How to Detect Redundant Attributes?
2. Covariance (Cov) for Numerical Data
• Consider two numeric attributes A and B, and a set of n observations {(a1, b1), …, (an, bn)}.
• The mean values of A and B, are also known as the expected values of A and B, that is:
Cov(AllElectronics, HighTech) > 0, the stock prices for both companies rise together.
Data Integration
• How to Detect Redundant Attributes?
3. Chi-Squared (Χ2) test for Categorical Data
• The larger the Χ2 value, the more likely the variables are related.
• The cells that contribute the most to the Χ2 value are those whose actual count is very
different from the expected count.
Where:
Attribute A has c distinct values;
Attribute B has r distinct values;
eij is the expected frequency;
oij is the observed frequency;
Data Integration
EXAMPLE
Suppose that a group of 1500 people was surveyed. The gender of each person was
noted. Each person was polled as to whether his or her preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender and
preferred reading. The observed frequency (or count) of each possible joint event is
summarized in the contingency table
Data Integration
EXAMPLE (Cont.) For 1-degree of freedom, the Χ2 value
needed to reject the hypothesis at the
0.001 significant level is 10.83.
• Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and
techniques. Elsevier.
Week 8 Lecture 8 (Part 2)
Data
Preparation
COS10022
Introduction to Data Science
Outline
• Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• DATA REDUCTION
• Data Transformation
33
Data Reduction
• Why data reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on
the complete data set
• What is data reduction?
• Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
Data Reduction Strategies
1. Data cube aggregation
• Aggregation operations are applied to the data in the construction of a data cube.
2. Attribute subset selection
• Irrelevant, weakly relevant, or redundant attributes or dimensions may be
detected and removed.
3. Dimensionality reduction
• Encoding mechanisms are used to reduce the data set size.
4. Numerosity reduction
• The data are replaced or estimated by alternative, smaller data representations
5. Discretization and concept hierarchy generation
• Raw data values for attributes are replaced by ranges or higher conceptual levels.
Data Cube Aggregation
• Heuristic methods that explore a reduced search space are commonly used to find a ‘good’ subset
of the original attributes.
• Stepwise forward selection
• The “best” (and “worst”) attributes are typically determined using tests of statistical significance,
which assume that the attributes are independent of one another.
• Other attribute evaluation measures such as information gain is used in building decision trees
for classification.
Attribute Subset Selection
Stepwise forward selection
1. Start with an empty set of attributes
2. Determine the best of the original
attributes and add it to the reduced set.
3. At each step, add the best of the
remaining original attributes to the
reduced set.
• Example:
Gene Expression
• The original expression by 3 genes is projected to two new dimensions. Such two-dimensional
visualization of the samples allow us to draw qualitative conclusions about the separability of
experimental conditions (marked by different colors).
Numerosity Reduction
• Numerosity reduction techniques replace the original data volume by choosing
alternative, smaller forms of data representation.
• Parametric methods
• These methods assume that the data fits some models.
• Models such as regression and log-linear model are used to estimate the data, so that only the
data parameters need to be stored, instead of the actual data.
• Non-parametric methods
• These methods do not assume models.
• Methods such as histogram, clustering, sampling and data cube aggregation are used to store
reduced representations of data
Numerosity Reduction: Parametric Method
Linear Regression Multiple Linear Regression Log-Linear Model
The data are modelled to fit a MLR allows a response variable Y to be The model takes the form of a function
straight line. The least-square modelled as a linear function of two or whose logarithm is a linear combination
method is used to fit the line. more predictor variables. of the parameters of the model.
Y = b0 + b1 X1 Y = b0 + b1 X1 + b2 X2 ln Y = b0 + b1X1 + Ԑ
Numerosity Reduction: Non-Parametric Methods
Binning Histogram
A top-down unsupervised splitting technique based on An unsupervised method to partition the values of an
a specified number of bins. attribute into disjoint ranges called buckets or bins.
Numerosity Reduction: Non-Parametric Methods
Clustering
• Data Cleaning
• Data Integration
• Data Reduction
• DATA TRANSFORMATION
48
Data Transformation
• Data transformation strategies:
1. Smoothing: Remove noise from data using techniques such as binning, regression and
clustering.
2. Attribute/feature construction: construct new attributes from the given set of attributes.
3. Aggregation: Construct data cubes
4. Normalization: Scale the attribute date to fall within a smaller, specified range such as -1.0 to
1.00, or 0.0 to 1.0.
5. Discretization: Replace raw values of a numeric attribute (e.g. age) with interval label (e.g. 0-10
11-12) or conceptual labels (e.g. youth, adult, senior).
6. Concept hierarchy generation for nominal data: Generalize attributes such as street to higher-
level concepts such as city or country.
Normalization
• Why normalization?
• Normalizing the data attempts to give all attributes an equal weights.
• When using distance-based method for clustering, normalization helps prevent attributes
with initially large range (e.g. income) from outweighing attributes with initially smaller
ranges (e.g. binary attributes).
• Examples:
• Income has range $3,000-$20,000
• Age has range 10-80
• Gender has domain Male/Female
Normalization
Min-Max Z-score Decimal scaling
Transforms the data into a desired range, usually Useful when the actual min and max Transform data into a range
[0, 1]. of attribute are unknown. between [-1, 1].
v − minA v − µA v
v' = (new _ maxA − new _ minA) + new _ minA v' = v' =
maxA − minA σ A 10 j
Where, [minA, maxA] is the initial range and Where μA and σA are the mean and Where j is the smallest integer
[new_minA, new_maxA] is the new range. standard deviation of the initial data such that Max(|ν’|) < 1.
values.
Let income range $12,000 to $98,000 normalized Let μ = $54,000, σ = $16,000. Then Suppose that the values of A
to [0.0, 1.0]. Then $73,000 is mapped to: $73,600 is transformed to: range from -986 to 917. Divide
73,600 − 54,000
each value by 1000 (i.e. j = 3):
73,600 − 12,000
(1.0 − 0) + 0 = 0.716 = 1.225 -986 normalizes to -0.986 and
98,000 − 12,000 16,000
917 normalizes to 0.917.
Discretization
Attribute
• Data discretization transforms
numeric data by mapping values to
Categorical Numerical
interval or concept label.
Binning Histogram
A top-down unsupervised splitting technique based on An unsupervised method to partition the values of an
a specified number of bins. attribute into disjoint ranges called buckets or bins.
Discretization
Cluster Analysis
• Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and
techniques. Elsevier.