0% found this document useful (0 votes)
17 views

Module 2 DMDW

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Module 2 DMDW

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

Data Mining and Data Warehousing

(21IS643)

Dr. Niharika P. Kumar


Associate Professor, Dept. of ISE
RVITM, Bengaluru
[email protected]
Module 2
Introduction to Data Mining
Datawarehouse Implementation

• A data warehouse contains humungous amount of data.


• Hence, the system needs to be implemented efficiently for faster cube
computation, quicker access methods and efficient query techniques.
Efficient Data Cube Computation : An overview
In normal SQL, we aggregate data using “group-by” clause.
Example 1: Compute the sum of sales but group the results by city and item.
Example 2: Compute the sum of sales (i.e., total sales), but group by city.
Example 3: Compute the sum of sales (i.e. total sales), but this time group by
type of items
Let us assume we have three attributes {city, item, year}.
Now the manager of the company wants to look at the sales numbers using
different “group-by” clauses using the data from the data ware house.
There are 23 = 8 possible “group-by” for this example {(city, item, year), (city, item),
(city, year), (item, year), (city), (item), (year), ()}.
We can visualize these 8 different group-by pictorially as shown in Figure
Each of these 8 group-by are called the “cuboids” of the
“data cube”.
So, what exactly does the figure represent?
The bottom (or base cuboid) means that any combination of the
three attributes can be used to query the results.
The top (or apex cuboid) gives result of all sales combined
together i.e. a single value for sales.
For faster data ware house access, all these cuboids can be
precomputed and the results of computation stored in
warehouse.
So whenever the manager wants an answer, we need not
compute every time. We just fetch the precomputed results. This
reduces the computation time.
Disadvantage:
The main disadvantage of doing precomputation & storing the results
is the amount of space consumed to store the results.
With just three attributes i.e. {city, item, year} we got 8 cuboids.
Imagine if we had hundreds of attributes. We may end up with
thousands of cuboids each generating lot of data.
Now we are just referring to attribute. Some attributes have their own
hierarchy. For example “address” is an attribute but “address” itself
contains “street”,”city”,”state”,”country”,”continent” all of these
need to be stored.
So if there are 10 dimensions and each dimension has 5 levels then
the total number of cuboids generated would be 510=9.8*106 cuboids.
So it is practically impossible to precompute all these cuboids and
store them for faster access.
So what we could do is partially compute (i.e. materialize) some of the
cuboids and store them. Let us look at different options below:
Partial Materialization: Selected Computation of Cuboids
There are three ways of data cube materialization (i.e.,
computation of cuboids)
(a) No materialization : This means precompute only
the base cuboid. : Do not precompute any of the
“nonbase” cuboids.
Disadvantage :
Computing of non-base queries have to be done as and
when the user asks them. This becomes expensive i.e.
time consuming operation.
(b) Full materialization: Precompute all of the cuboids.
Disadvantage :
Consumes lot of space to store the precompilation
(c) Partial materialization
➢ Selectively precompute only a subset of all possible cuboids.
➢ The choice of subset may depend on user-specified criteria.
➢ This method is a tradeoff between saving storage space and
reducing response time.

There are some ways to select cuboids


(a) Iceberg cube: Construct a data cube that stores only those cells
whose aggregate value(ie count) crosses some threshold.
(b) Shell cube: This is another method to materialize. In this
method we precompute only those cuboids that have small
number of dimensions(ie say three dimensions or five
dimensions)
If we have to compute the cuboids for higher dimensions then
it can be done on the fly(ie when needed)
Indexing OLAP Data:Bitmap

Bitmap Indexing
Join Indexing
OLAP query Processing

1,3 and 4
OLAP Server Architecture ROLAP v/s MOLAP v/s HOLAP

• OLAP server show the multidimensional data to the user(in the form of
charts, graphs, tables etc).
• But how does this data get stored inside the warehouse?
• There are three ways to implement the storage of the warehouse server.
✓ Relational OLAP(ROLAP) Servers
▪ These servers sit between the relational backend server(ie say the
operational database containing ecommerce transactions) and the front
end tools (like charts etc).
▪ They are responsible for storing the warehouse data. They use
relational DBMS or extended relational DBMS to store this
warehouse data.
▪ They also store middleware data
▪ ROLAP are more scalable compared to MOLAP
➢ Each row is NOT transactions of users but each row is actually
warehouse data from one of the cuboids.
Multidimensional OLAP(MOLAP) servers

• MOLAP servers support multidimensional data(ex sales data for 3


dimensions like item, location, time/quarter) by storing them in “array
based multidimensional storage engines”.
• Whenever we create a multidimensional view it directly gets mapped to
these data cube array structures.
• Disadvantage: If the data is sparse(ie not dense) then multidimensional
OLAP are not getting used efficiently.

Hybrid OLAP(HOLAP) Servers

• These servers combine ROLAP and MOLAP technologies


• Advantage: ROLAP part gives it the scalability and MOLAP part gives it
the faster computation ability.
• Ex Large amount of detained data can be stored whereas the
aggregation (ie summary ) can be kept in MOLAP storage.
• Microsoft SQL server 2000 supports a hybrid OLAP server.

Note: End of Text Book 2


Lot of well known applications are generating lot of data that need new
techniques for data analysis.
Business
• Retailers/Shopkeepers collect point-of-sale data that provides details on
customer purchases.
• Even ecommerce sites generate business critical data
• Data mining techniques can be used to provide business intelligence
applications like customer,profiling,targeted marketing,fraud detection
etc
• It helps answer questions like who are the most profitable customers,
which products can be cross sold.
Medicine
• NASA has deployed satellites to get the climate system related data.
Traditional data analysis tools can’t be used for this data.
• Genomic data is being gathered to understand structure and function of
genes. Traditional data analysis tools help to analyze few genes at a
time. We need data mining tools to understand and analyze large
number of genes.
What is Data Mining?
• Data mining is the process of automatically discovering useful
information in large data repositories.
• Data mining involves scouring large database and trying to find
novel & useful patterns.
• This is NOT same as information retrieval like querying employee
records or fetching a web page. These are NOT Data mining
tasks.
Data mining and knowledge Discovery
• Data mining is one of the key steps as a part of knowledge discovery in
database(KDD).
• KDD involves a series of steps to convert input data into useful information.
These steps are:
➢ Input Data
➢ Data Preprocessing
➢ Data Mining
➢ Post Processing
Input Data
✓ Input data can reside in different formats like flat files, spread sheets,
relational tables.
✓ This data can either be in a single place or distributed in multiple sites.
Preprocessing
✓ This process involves converting the raw data into appropriate format
so that it can be processed further.
✓ This involves cleaning data, noise removal, duplicate removal, fuse
data from multiple sources. Etc
Postprocessing
✓ The insights offered by data mining with other tools to get better
visualization.
✓ Ware house can also use hypothesis testing methods during post
processing phase to remove spurious data mining results.
➢ Motivating Challenges
• What were the challenges faced by traditional data analysis
techniques that led to evolution of data mining?
➢ Scalability
▪ We now generate data in Gigabytes, Terabytes and Petabytes.
▪ Datamining algorithms should be designed to be scalable to
handle growing data.

▪ Data mining algorithms use


(a) spatial search strategies to handle search queries
(b) We will also need data novel data structures if the data mining
algorithms have to be scalable.
(c) Parallel and distributed algorithms are designed to improve
scalability.
High Dimensionality
• Data sets usually have thousands of attributes (ie columns in
the data table)
• Ex: Temperature measurements of locations taken repeatedly
can result in thousands of dimensions.
• In such scenarios traditional data analysis techniques do not
work well.
Heterogeneous & Complex data
• Usually traditional data analysis tools work on data that is
homogeneous (one type of data).
• But all recent fields like medicine, science, business generate
heterogeneous data:
➢ Ex climate data containing time service
information(temperature, pressure etc)
➢ Ex web page data containing text, link, images etc
➢ Data mining tools need to handle such heterogeneous data.
Data Ownership & Distribution

▪ Data needed for mining need not be centralized or it need NOT be


owned by one organization.
▪ So, the challenge is to develop data mining tools that handle
distributed data.
There are some challenges in handling distributed data
(a) How to reduce communication across distributed
components
(b) How to consolidate results of data mining from different
sources
(c) How to handle data security issues

Non Traditional Data Analysis

• In traditional data analysis technique, a hypothesis is first proposed, then


engineers collect data and then this data is analyzed to prove the
hypothesis
• However current data analysis actually uses techniques to generate
thousands and evaluate them. Also, the data generated is not from
careful experiments but they are random samples, hence, we need non –
traditional analysis
Data Mining Tasks

Data mining tasks can be divided into two major categories:


(a) Predictive Tasks
In this task we use a set of attributes(called exploratory/independent
variables) in order to predict the final variable(called target/dependent
variable).
(b) Descriptive Tasks
• Here the objective is to derive some patterns (is trends, clusters,
correlation) out of the data so that we can summarize the relationship
among the data.
• These tasks are usually exploratory in nature and need a lot of
postprocessing in order to explain the results.
There are 4 core datamining tasks used throughout the syllabus:
(a) Predictive Modelling
(b) Association Analysis
(c) Cluster Analysis
(d) Anomaly Detection
Predictive Modelling

• This is the most famous modelling activity in AI & ML


• ie predict the ouput(target variable) based on a set of
inputs.(exploratory variables)
There are two types of predictive modelling
(a) Classification
ie: classify a picture as cat or dog
(b) Regression
Ie: predict the price of a house based on certain input parameters
➢ Classification outputs discrete output (YES/NO) where as
Regression outputs continuous values as output(stock price or
house price etc)
In both cases the aim is to learn a model that minimize the error
between the predicted value and the truth/actual value
Iris Flow Dataset

Let us try to understand a machine learning models using Iris Dataset


Iris is a type of flower that comes in 3 varieties that differ in their petal type
Iris Flow Dataset
Example : IRIS Dataset

UCI Machine Learning Repository at


http://www.ics.uci.edu/∼mlearn.

➢ The IRIS dataset is a data set containing petal length & petal width
for 3 flowers iris species: setosa, versicolor or virginica.
➢ Two more features: sepal length & sepal width

• If petal length & width are low


then it is setosa.
• If petal length & width are medium
then it is versicolor.
• If both are large then it is Virginica
Association Analysis
• This is the second type of mining task.
• The aim of this analysis is to find pattern in the data such that they
hold the feature(column) together.
Ex 1 : Find a group of genes that have related functionality
Ex 2: Identify websites that are accessed together
Ex 3: Identify relationship between different elements of earth systems

*Note : Whoever Buys diapers also buys Milk {Diaper} -> {Milk}
Cluster Analysis

• This is the third type of data mining task that tries to find or
make groups of related observations.
• ie all related items are clustered together
Ex Document Clustering

Good clustering
Cluster 1: Related to Economy algorithms will find out
Cluster 2: Related to Health these two clusters
Anomaly Detection

• This is the 4th type of data mining task that tries to identify
observations that are significantly different from rest of the
data.
• Aim is to identify anomalies & avoid false detection/labelling
normal objects as anomalies.
• A good anomaly detection algorithm has high detection rate &
low false alarm rate.

Example: Credit card fraud detection


Create a history of transactions for a user. When a new
transaction is made on the credit card, check if this transaction
could be a fraud one or a genuine one from user.
Types of Data
Dataset: It is a collection of data objects
Data Object: Each data object is a collection of attributes(ie each row of
dataset is called a data object)
It is also called a record, vector, point, pattern, event, case,sample

Attributes:
Each attribute captures one basic characteristic of an object(each column of
the dataset is called an attribute)
It is also called variable, characteristic, field, feature, dimension
Attributes and Measurement

What is an Attribute?
An attribute is a property or characteristic of an object that may
vary, either from one object to another or from one time to another.
Example: color of eye is an attribute of a person object. Each
person has a different eye color. So, if we have a “person” dataset
with each row having information of 1 person then eye color will be
one attribute of a person(ie one column of a dataset)

Example: Temparature is another attribute that can take unlimited


values whereas eye color attribute takes limited values

What is measurement scale?


• It is a process of assigning values to the attributes
• Ex gender is assigned value male/female
Type of An Attribute

Employee Age & ID number


➢ EmployeeID and age are two attributes we associate with an
employee.
➢ Both can take integer values
➢ While we can compute average age of employees, it does not
make sense to calculate average an employee ID
➢ We have a max limit on age attribute we don’t need any limit on
employee ID.
➢ Hence, certain operations should be limited on certain attributes
based on type of attribute(even though both attributes are
represented using integers).

Type of measuring scale

✓ The type of an attribute should tell us what operation can


be performed on the attributes.
✓ Avoid doing operations like perform average on emp ID
The Different Types of Attributes

• Each attribute has certain underlying properties of attributes


• Ex: length attribute has properties of a number.
• We can also compare two length or order based on length or calculate
difference or ratio of length.

The following are


operations that can be
performed on attributes

There are four different 4 attributes can be divided


types of attributes: further
(a) Nominal ✓ Categorical(ie qualitative)
(b) Ordinal Nominal & ordinal attributes
(c) Interval ✓ Numeric(ie quantitative)
(d) ratio Interval & ratio attribute
• Qualitative attributes like employee ID, cannot be operated like
numbers, even if we use integers to represent them. These should be
treated like symbols.
• Quantitative attributes are treated like regular numbers or integers to
perform arithmetic operations on them.
Some transformations can be applied on the four different types of attributes
● In case of Nominal attributes we can remap one to-one the values
Ex: Employee Ids can be remapped/reassigned without causing
differences.

● In case of Ordinal attributes we can perform transformation that


preserves the order of the values i.e new values=R(old value)
Ex:- If old values = {1,2,3} we can transform them to {0.5,1,10}

● In the case of Interval attributes we can multiply and add


constants to old values to get new values.
Ex:- New value=a*old value + b

● In case of Ratio attribute we can perform arithmetic transformation


Ex:- New value= a*old value
Ex:- Length can be transformed from meter to feet or vice versa
Describing Attributes by the Number of Values :-

Attributes can also be distinguished by the number of values they take.

Discrete: A discrete attribute either takes finite values or countably infinite


values i.e. the values are eventually countable.
Ex:- Employee ID, Pincode

Binary attributes:-These are special types of discrete attributes that take


only two values 0/1 or true/false. We use boolean for such attributes.

Continuous attributes:- Continuous attribute takes real numbers ( infinite


+ uncountable)
Ex:- temperature, height, weight etc

Continuous attributes are float variables


Asymmetric Attributes

Attributes where only non zero values is important are called asymmetric

attributes

In case of such attributes we should focus only on non zero values and ignore

attributes that have zero value

Ex:- A record of a student will have ‘1’ for a course that a student has taken and

‘0’ for a course that he or she has not taken. If you concentrate on ‘0’ then we

will be looking at a lot of attributes with values ‘0’ as students take only a few

courses. In such cases we should concentrate on attributes that have ‘1’ for the

course.
Types of Data sets:-

There are three types of data sets:-

● Record data

● Graph Based data

● Ordered data

Before we look at each one of these let us look at some general

characteristics that all these three types of data sets have

a) Dimensionality

b) Sparsity

c) Resolution
a) Dimensionality
● Dimensionality means number of attributes i.e. number of columns in a data
set.
● Higher dimensionality, that is more columns, is not preferred. So we generally
apply dimensionality reduction techniques to reduce dimensions(i.e. columns)
b) Sparsity
● Sparsity means to arrange data in such a way that we can concentrate only
on non zero attributes.
● By storing sparse data we save space and computation time.
c) Resolution
● It is important to decide at what resolution the database should be stored.
● Ex:- Surface of earth will look uneven if we look at Earth at metre level
resolution but if you look at Earth at km level resolution Earth looks smooth.
● Ex:- If we look at the weather pattern at the granularity of 'hours’ then we get
details on Storms etc but if you look at granularity of ‘months’ we might miss
them.
Dataset Type1

a)Dataset Type 1:- Record data


● Most of the data sets are usually of this type i.e. record data.
● Each record will have zero set of attributes.
● Each record is independent and usually has no relation with another
record.
● Record data is stored in flat files in a relationship database.

● There are many types within


record data:
1.Transaction data
2. Data matrix
3. Document term
matrix ( Sparse data matrix)
Let us look at these three types of
record data.
Transaction Data(Market Based Data)
.
=) Each record is a transaction that stores a set of items.
Ex:- Each transaction in a grocery store will have a set of items brought by a
customer.
=)Usually in such a transaction record individual products that were purchased
are the items and mostly have binary values for the items i.e. mark them as
zero if not bought or mark them as one if they are bought.
=)We can also have “count” for each item i.e. the number of each item bought.
Ex. 4 shirt, 3 chains etc
The figure shows a transaction data with transaction records
The Data Matrix

=)If every row of the data set needs all the attributes and all attributes
take some real values /integer values then we call such data a Data
Matrix.
=)Such data Matrix will be m x n matrix with 'm' rows 1 for each
object and 'n' columns 1 for each attribute
=) Example is the Iris flower data set where each row is for one
sample flower with four columns ( sepal length , petal length, sepal
width, petal width)
Such data set will have m x 4 dimensions i.e. ‘n’ rows for n different
flower samples and each row with 4 columns.
The Sparse Data Matrix

=) This is a special type of data matrix where some attributes


can have ‘0’ value.
=)Transaction data that we saw in 1.1 is an example of sparse
data matrix
=)Document term matrix is another example for this. In the
document term matrix we go through ‘m’ documents and list
the words found in these documents. All documents will not
have all words. So we mark ‘0’ for words not found in the
document and we mark the count of words found for a
specific word.
Data set Type 2:- Graph Based data
● This is the second type of dataset. Sometimes it is convenient to
show data in the form of a graph. There are two types of graph
based data.
2.1. Data with relationship among objects
● Objects(i.e rows) may be related to each other. Such objects are
shown as graphs.
● Nodes- objects become the nodes of graph
● Links- Relation between objects are shown as links

The Figure shows an


example for the world wide
web. i.e search engines
create a graph of web
pages where the web
pages become nodes and
the way users browse from
one page to another is
used as links to connect
the web pages.
Data with objects that are graphs

● This is the second type of graph based data.


● Each object can be made up of sub objects and these sub
objects can be linked with each other.
● In such a case each object can be shown as a graph.
● Ex: Chemical compounds are made of atoms that have ionic
and covalent bonds. So the atoms are shown as nodes and the
chemical bonds are shown as links.
(c) Data set type 3: Ordered data

● This is the third type of data set.


● There can be relationship between the attributes
i.e. Columns in a dataset. So the rows may need to
be ordered in a particular sequence. Such data
sets are called ordered dataset.
● We will look at four sub topics under order data:
(a) sequential
(b) sequence
(c) time series
(d) spatial
Sequential data: (Temporal data) :
● Each transaction can have a time associated with it.
● Using the customer identity and the time information we can link
the transactions of a customer and try to find buying patterns.
● Ex: People who got DVD players also bought DVDs within a few
hours/days.
● The Figure shows how a bunch of transactions from customer
C1, C2 and C3 are grouped and sequenced on a per-customer
basis.
Sequence Data
● Sequence data has a sequence of entities like a sequence of
words or letters.
● It looks like sequential data(of 3.1) but unlike sequential data, we
don't have time stamps associated with sequence data.
● Ex: A sequence of A,T,G and C nucleotides that make the upper
gene in a DNA.
● In this case we store a sequence of A,T,G and C and analyse
them to identify genetic defects.
Time series data
● This is a type of ordered data where each record is a time series so
each row (i.e. record) of data has data that was recorded at a
different time. Read the statement carefully again.
● Each row contains a set of values that are recorded at different
times.
● Ex: Financial data(stock price) :
● Each row in financial data is for stock price for the day so each row
has a 100 columns
● and each column records the stock price at a specific time for that
day.
● So each row becomes a time series for that day.
Temporal correlation

● There is a specific data that we see as a part of time series data it is


called temporal correlation. This means that, if we take
measurements that are closer in time then the values for these
measurements will also be close.
● Ex: Share price at 9:01 a.m. 9:02 a.m. would be very close to say
rupees 310.5 and rupees 310.55, so there is temporal correlation
between the share prices.
Spatial Data
● Some data have connections to space (i.e. place) where we take
measurements.
● Ex: The weather i.e. rainfall temperature pressure depends on which
location we measure these parameters.
● Such data is called spatial data.
Spatial correlation:
Measurements taken at close distance will have similar measurement reading.
● Ex: If we measure temperature at 2 points that are 50 M apart then
temperature is usually the same. This is called spatial correlation.
● So what we are saying is that spatial data is a type of order data.
Data Quality

The data that we collect for data mining can have issues with the quality of data.
We will look at some aspects of data quality here.
Measurement & Data Collection Issues
We can have issues while
measuring the values
collecting the data
Measurement Errors
Any problems arising from the measurement process
✓ ie value recorded differs from the true value
✓ The difference between recorded value & true value is measurement error.
Data Collection Error
✓ Errors introduced during data collection process
✓ The person recording the data missed typing few values for few rows is the
dataset. So either the objects (rows) are missing/duplicated or the feature
(columns) are missing values.
a) Noise and Artifacts
When we are measuring the values, a random component may get added
to the measurement and hence distort the values.
Fig shows how the time series data got distorted because noise got
introduced.

Similarly next Fig shows noise points (i.e '+') that got added to spatial
data. Generally noise gets added to temporal(ie time series data like a
sound wave) or spatial data (like temperature at different places).
We can use many signal processing and image processing techniques to
reduce the noise.
Quantifying the Error: Precision, Bias, Accuracy

How do we quantify the error in measured values v/s actual values? We


use three terms to do that
• precision
• bias
• accuracy
Precision: Definition
The closeness of repeated measurements (of the same quantity)
to one another.
i.e if we measure, say 5 times, how close are the values to each other.
Precision is nothing but standard deviations.
Bias: Definition: A systematic variation of measurements from the
quantity being measured.
Take, say 5 measurements and calculate its mean.
Find the difference between this 'mean' and the expected value.
Example for precision and bias:
Example for precision and bias
If we measured the mass five times and we got the
following values :
{1.015, 0.990, 1.013, 1.001, 0.986}.
Now, The mean of these values is 1.001kg
If actual mass was 1kg then the
Bias = |mean-actual value|
Bias = |1.001-1.0|
Bias = 0.001
Precision = Standard deviation = 0.013
Definition (Accuracy). The closeness of measurements to the true
value of the quantity being measured.
i.e. how close are we to the true value.
Outliers

Any anomalous values is called an outlier.


If a data object i.e. a row in a data set has values that seem to be
very different from all other data objects in the dataset then it is
called an outlier.
Or for a single data object one particular attribute ( i.e. one column
) has an unusual value even then it is an outlier.
Ex: We were measuring IQ levels and we had all values in a range of
say 110 to 120 but one IQ value came out 160 then it is an outlier.
Outlier is different from noise. Noise is always spurious but
outliners can be legitimate values and may be of interest to us.
In Fraud detection models we generally look and seek outliers.
So outliers are a part of data quality issue .
Missing Values

Missing value is the third type of data quality issue w.r.t


measurement/data error.
We could be missing one or more attribute values because
either information was not collected (ex : people declined
to give their age or their field was not applicable to some
users).
So how do we deal with missing values?
There are 3 ways to deal with this :
C.1) Eliminate data objects/attributes :
If few rows (i.e. objects) have missing data then simply
remove these objects.
If an attribute is missing most of the data then eliminate
the attribute (i.e. remove the column from dataset).
C.2) Estimate missing value :
Sometimes we have continuous value based attributes (like
temperature humidity) and we may be missing few values.
In such a case instead of removing the row due to missing values,
we could estimate/ guesstimate the value and enter/update these
missing values.
We usually do this for continuous values.
For discrete value-based attribute we could put up the most
commonly occurring value in these empty slot.
C.3) Ignore the missing values during analysis :
Sometimes in certain use cases like clustering we can ignore the
missing attribute values and still perform an analysis.
If it is possible to ignore them ignore the values and continue.
d) Inconsistent values

This is the fourth type of data quality issue w.r.t


measurement/data error.
Sometimes the data can be inconsistent for
example, if an address has pin code and city these
two values may not match in reality. This might
happen if the data entry person wrongly typed the
pin code or the scanning software wrongly
recognise the numbers etc.
Some inconsistencies can be detected like height
of a person being negative value some other need
expect opinion.
e) Duplicate Data

This is the fifth type of data quality issue w.r.t measurement/data


error.
A data set can contain data objects that is rows that are
duplicates. In such a case where two rows actually represent a
single object then these should be combined.
However sometimes two objects (rows) look like duplicates but
are not duplicates (two people with identical name) such entries
should go through a “duplication” process so that they can be
“made” non identical.
We are done with the topic on measurement and data collection
issues we looked at five types of issues.
Next we look at another topic that affects data quality that is
related to applications.
Issues Related to Applications

Quality of data can suddenly change based on the type of application


where this data is generated and used. Let us look at some aspects
related to this.
Timeliness : Certain types of data, for example, ecommerce data
become outdated very quickly. So it’s value (i.e. Quality) lasts for very
short amount of time. So the models generated using this data also lose
value quickly.
Relevance : The data available with us should contain all the necessary
and important fields. Else it becomes low quality data.
For example, if we wish to predict the accident rate of drivers using a
dataset that does not contain the age of drivers or gender then the data
quality is very poor and not helpful.
Another problem is sampling bias. Suppose we sample only one type of
data and try to model, then it will be skewed towards the data used in
modelling the system.
Knowledge about the Data : Dataset generally comes
with documentation. If this documentation is poor, then
we might misinterpret the values in the data.
For example if the creators of data have used – 9999 as
replacement for missing fields and the documentation
does not mention this then we will use this data without
knowing this and our model will turn out to be very poor.
Sometimes the documentation does not tell us which
attribute is of type nominal or ordinal or interval or ratio.
Sometimes the units of data are missing (ex : meters or
feet for length) all these will result in creating a model
that will be erroneous.
Data Pre-processing
There are 7 data pre-processing topics :
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation
In short these 7 pre-processing steps help in choosing two things i.e.
✓ Which data objects (i.e. Rows) should be selected for data mining.
✓ Which attribute (i.e. Column) should be selected for data mining.
Aggregation
Assume that our dataset contains hundreds and thousands
of rows (i.e. Objects). We want to reduce the number of
objects by combining many objects (i.e. Rows).
So, aggregation is the process of combining two or more
objects (i.e. Rows).
Example : Assume that we have a dataset of sales data of all
customers from all cities for all stores in US.
One way to aggregate can be to combine all the sales data
of a single store into just one row. So all sales of one store
get merged into one row.
This way we get one transaction/row per day for one store.
And number of rows will be now equal to number of stores
(i.e. If the company has 100 stores then the dataset will now
have 100 rows).
Question: How do we club the columns now?
Answer:
Quantitative Data : If a column contains quantitative data (ex. Sale amount)
then while clubbing the rates, we can simply sum up all the sales values.
Qualitative Attribute : But in case of qualitative data like item name (TV, bike,
fridge etc) we could simply ignore this column.
Question: What is the motivation for doing this aggregation activity ?
Answer: There are many reasons for this:
a) Memory : Clubbing objects/rows saves storage space and processing time.
b) Change in scope : By clubbing the data we get a high level view of data.
Ex : By clubbing sales on per-store basis we now get a store-level sales
information.
a) Data Stability : Individual entries show off variability, but aggregate data provide
consistent view.
Question: Any disadvantage of aggregation?
Answer: We could lose potential details on micro aspects. For example, if
we aggregate sales on a per month basis, then we lose information on
which day of the week had highest sales.
Example :
Figure 2.8 shows histogram of rainfall in Australia.
Figure 2.8(a) is for monthly rainfall while Figure 2.8(b) is for yearly rainfall
for the same locations.
We see that the monthly data has higher standard deviation
Sampling

This is a second type of data preprocessing


Sampling means choose a subset of objects (or rows )from the
data set so that a better and more expensive data mining
algorithm can be comfortably Run using this subset of data

If we try to run good algorithms with lot of data then it


becomes expensive with respect to time and space hence we
have to choose a subset of data

Samples should be representative: You should choose


samples from the data set such that it is representative of the
entire data set

ie. sample should cover all the types or variants of information


from the original data set
There are Three approaches to sampling data as
shown below:

Sampling Approach
1 Random sampling
sampling without replacement
sampling with replacement
2 Stratified sampling
equal objects drawn
proportional objects drawn
3 Progressive sampling
Random Sampling

Random sampling is a process where we randomly pick the objects (ie


rows) from the data set

There are 2 ways to randomly sample

sampling without replacement in this approach we sample an object(ie


row) and then we remove it from the data set

This ensure that the same row does not get picked again when we
sample the next sample

sample with replacement in this approach we sample an object(ie row)


and then we don’t remove it from the data set.

so there is a chance that the same row may get picked twice during
sampling process
Stratified sampling

If there are n types of objects in the data set then in stratified


sampling we make sure that at least one sample from each type
of object is present in the sample data

example If we have a data set of images of Cat Dogs cows


lions and monkeys when we sample say 10 images we make
sure that there is at least one image of each animal in the
sample set

There are two ways of doing stratified sampling

1. equal objects drawn In this approach we ensure that we


pick same number of samples of cat dog cow (ie say 2 cat
images ,2 dog images, 2 cow images etc)
Proportional objects drawn
In this approach we draw objects rows based on the
proportion of object in the original data set.
If we have 100 images and 50 are dogs, 40 are cats and 10
are cows .If we have to sample only 10 images then we will
pick proportionally five Dogs four cats and one cow
Progressive sampling
In this method we start with a small sample size(say 10 samples) and then we
progressively add more samples to this sample set based on the need

This method is used in predictive modelling

Example assume that we have written ML algorithm to detect if an animal is a


dog or a cat .We start with a small sample set to train the model obviously it's
accuracy will be bad. As we add more samples to sample set the performance
increases. But after a point adding more sample to samples to the sample set
does not improve the performance so we stop adding more rows at this point
Sampling result in the loss of information
Dimensionality Reduction

This is a third step in data preprocessing

In the sampling step we reduce the number of rows in this step of


dimensionality reduction we are interested in reducing the number of
columns i:e attributes (or features).

When we create a dataset we may have thousands of columns in the


data .But running an ML algorithm with this means Columns/ Features
is undesirable.

In this topic we look at reducing the number of columns by


merging/clubbing the columns

In the next topic(ie feature selection) we look at reducing column


count by picking /choosing fewer columns

So this topic is not related to picking and choosing columns


Advantages of Dimensionality Reduction

(a) Many data mining algorithms work best when the number of
column (dimensions) are less.

(b) By reducing the dimensions we can understand our model


better as it has fewer attributes.

(c) With fewer dimensions we can visualise the features better

(d) Fewer dimensions means saving storage space and reduce


access time
The Curse of dimensionality

We usually come across this time very frequently during learning


ML algorithms

As dimensionality increases data analysis becomes harder


because

✓ Data becomes sparse ie we will have many columns where


data will be Null or empty. So analysis becomes difficult

✓ Classification :In case of classification problems this results in


a model that does not get classify reliably.

✓ Clustering: In the case of clustering problems the clustering of


points become less meaningful and hence clustering will be
poor
Techniques to reduce the dimensionality

So our goal is to reduce dimensionality by using linear algebra based


techniques
There are two important ways to reduce dimensionality
(a)Principle Component Analysis(PCA)
(b) Singular Value Decomposition(SVA)
Principle Component Analysis
In this algebraic method we add new attributes from existing attributes by
following three things as given below. These new attributes are called principal
components
✓ Perform linear combination of the original attributes
✓ Retain attributes as in if they are orthogonal (ie perpendicular ie dissimilar )
to each other
✓ Retain attributes that capture maximum amount of variation. If data in two
columns are similar or they can be derived from one another then ignore one
of the columns
Singular Value Decomposition(SVD)
This is related to PCA and is another linear algebra based technique
Feature subset selection
• This is the fourth step in data preprocessing

• In this step we try to remove features (ie columns) that are not helpful .There
are two ad hoc and 3 systematic approaches for features selection

(a) Remove Redundant features: In this approach we remove those features


that are redundant
example if one column has purchase price then we don't need the column for
sales tax as the sales tax is directly proportional to purchase price. We can retain
only the one of these columns.
(b) Remove irrelevant features :Some features don't add any value for example
employee ID or roll number of students do not help in any way for data mining .So
remove such columns from the data set.
Other than these two methods there are three standard approaches for features
selection embedded, filter, wrapper
(c) Embedded approach :Sometimes feature selection automatically happens
as a part of data mining ex Decision tree classifiers inherently perform feature
selection
(d) Filter Approach :Feature selection is done independent
of the data mining task choose the attributes (ie columns)
that have very less correlation between them

(e) Wrapper Approach :Use the data mining algorithm as a


black box to find a subset of attributes.
We will look at filter and wrapper approach in Greater
detail below

Architecture for Feature Subset


Selection
Figure shows and architecture/ flowchart
of how features election can be done
(a) We will start with the whole set of attributes

(b) Search strategy we can apply search strategy to identify a


subset of features the strategy use to find the subset, should be
such that it is computationally inexpensive and it should find
near optional subject of attributes

(c) Evaluation:Once we have identify the subset we then


pass the subset of attributes through the evaluation procedure
to check how good this subset is for the data mining task
(example classification task or clustering)
(d) Stopping Criteria we can potentially create millions
of subset of attributes .If we keep on evaluating each one
of these then we will never complete the feature selection
process so we need some stopping criteria to tell
ourselves that yes we have found a good subset of
attributes some stopping criteria are:

✓ Stop after certain X number of iterations

✓ Stop if evaluation results crosses the some threshold

✓ If the size of the attribute subset has reached the


certain size
✓ if we have met both the size and evaluation criteria. If
yes then stop
Validation Procedure

● Once we have finally shortlisted the subset of attributes we need


to validate if this final set is good enough.
● The best way to evaluate is as follows:
1. First run the mining algorithm using all the features/attributes
& measure performance.
2. Then use the chosen subset of attributes and run the mining
algorithm again if the performance of this match with the
complete attributes performance then we have shortlisted a
good subset.

● Another method to evaluate is as follows:


1. Run different feature selection algorithms. each one will give
us its own subset of features.
2. Run each of these subsets on the data mining algorithms &
measure performance for each subset.
3. Choose the subset that gets the best score.
Feature Weighing

● This is another feature selection approach where each feature /


attribute is given a weightage.
● Important features get higher weightage whereas less important
features are given lower weightage.
● Hence , when the data mining task is executed ,important
features contribute more to the mining task.
Feature Creation

● This is the 5th step of data preprocessing.


● Features creation means creating new features / attributes from
existing attributes.
● There are three methods of creating new attributes (i.e new
column of the dataset)
1. Feature Extraction
2. Mapping data to a new space
3. Feature construction
(a) Feature Extraction

● Feature extraction means extracting features from raw


data.
● Example: Suppose we have pixel information as raw
data, now each pixel acts like an attribute or column in
a data set.
1. We can club many pixels together to form edges /
shapes.
2. If our aim is to recognize human faces then only a few
of these clubbed pixels can help in constructing a
human face.
3. So select only those groups of pixels that aid in the
creation of the face.

● Feature extraction process is very specific to the


application/ domain.
(b) Mapping data to a new space

● We can get important and interesting features from data if we


change our view of the data.
● For example : look at the figure below, this looks like a circle.

But what if I say that this is actually a cone which you are
viewing from top.
A cone’s top view looks like a circle.
Fourier transform
1. Generally an electromagnetic signal in the time
domain looks like, say, a sine wave.
2. When we analyse the signal in the time domain we
may not notice any issues or anomalies.
3. However when we use fourier transform and convert
the domain signal to frequency domain signal then we
might observe noise in the signal.

Fourier Analysis
Fourier Analysis

● This figure is obtained when 3 signals are combined together.


● Out of these 3 signals, two signals are valid signals. one of 7Hz and
another of 17Hz. The 3rd signal is actually a noise.
● Now looking at Fig 2.12(b) we cannot make out that the combined
signal actually contains noise.
● But, suppose we take this combined signal and perform fourier
transformation then we get the figure 2.12(c).
● In fig 2.12(c) we can closely see two peaks. Each peak is for one
signal. The 7Hz signal has a peak and the 17Hz signal also has a peak.

The rest is all smaller signals that are the noise.


● So using the fourier transformation of the original signal we would get
useful information.
(c) Feature construction

● This is the 3rd method of feature creation.


● In the original data , we may have all information but it is not in
usable form.
So we construct features out of this data.

● Example:
We have a dataset of historical excavated items.For each item we
have its mass and volume.But we want to find the material from
which the item was made
● One way to do it is to calculate mass/ volume to get the density.The
value of density gives us an idea of material used to make it.
Discretization and Binarization

● This is the 6th step of data preprocessing.


● Two important and interesting aspects of data processing are
discretization and binarization:-
Discretization: The process of converting a continuous quantity
into a discrete quantity is called discretization .This step is
necessary for classification problems.
Binarization: The process of converting either continuous or
discrete values into binary data is called binarization.
Let us look at both these topics.
Binarization:
● As the name suggests , binarization is to convert features
into binary values .This is a multi step process.
Step 1:For each of the categorical values, assign an integer
value,ex

Step 2:Convert each of these integer values to binary number as shown


below.,ex

● Now this x1,x2,x3 could have been the column that could have
replaced the categorical column inside our data set.
● But there is a problem ,note that for both “good” and “OK”, the
column “x2” has 1,we accidentally added a correlation between two
columns / attributes.
● So even though this kind of helps but this is not the final solution
Step 3:Instead , convert each integer value to asymmetric
binary attributes.

These ‘5’ columns can now be added to the dataset and we can remove
the column “categorical values” from the dataset .In essence we
binarized the categorical value attribute.
Example 2:A binary attribute can also be replaced with two asymmetric
binary attributes
For example Gender male & female is replaced as below
Discretization of Continuous Attributes

● Some attributes like temperature , pressure etc are continuous


values Ex: 98.46 degree C, 98.694 degree C,99.012 degree C
etc.You see that these temperature values are not discrete
values(like 90,91,92 etc).
● But Datamining cannot be performed on such continuous
values.These values need to be converted to discrete values . The
process of converting discrete values to continuous values is a two
step process.

Step 1:Sort the values and divide them into ‘n’ intervals by specifying
n-1 split points.
Ex:If we have values like 5.4,12.9,6.8,25.4,16.5 then we specify 3
intervals with two split points i.e 10 and 20.In this scenario 5.4 & 6.8 will
be to the left of 10 , where as 12.9 & 16.5 will be between 10 & 20 And
finally 25.4 will to right of 20.
Step 2:Round off these values to a pre agreed value within the
interval.
Ex:5.4 & 6.8 may get rounded to say ‘5’. 12.9 & 16.5 may get rounded
to say 15.

● Challenge:- So the challenge in case of discretization is to specify /


figure out
(a) How many split points to choose.
(b) Where to place these split points.

These are two ways to perform discretization.


(1)Unsupervised Discretization
(2)Supervised Discretization
Unsupervised Discretization

Unsupervised discretization does not take the class information into


consideration while splitting the data into different groups.

Unsupervised discretization can be further split into 3 types.


(a) Equal width
(b) Equal frequency
(c) K – means
The figure shows all these 3 methods.
Figure 2.13 shows all these 3 methods. In this Figure 2.13(a) shows the
values of each of these continuous attributes. i.e., some values are around
5, some around 7.5, some of them around 10 and some around 15.
✓ Equal Width: If we decide to use equal width discretization then we
will divide our values into four intervals i.e. (0 to 5), (5 to 10), (10 to
15) and (15 to 20).
All values between 0 to 5 will then be rounded to a single value.
Similarly, all values between 5 to 10 would get rounded off to some
value and so on.

✓ Equal Frequency: In this approach we draw the split lines such that
each interval has same number of points. In Figure 2.13(c) we draw
the lines such that if (0 to 5.5) interval has 120 points then the next
group say (5.5 to 9) will also have 120 points and next group (9 to
14.8) will also have 120 points and so on.

✓ K-Means: This is a type of clustering technique. Figure 2.13(d)


shows the result of k-means (we will learn k-means later).
Out of the 3 techniques, k-means seems to provide better results.
Supervised Discretization

✓ Unsupervised discretization does not take the class labels into consideration
when it performs discretization.
✓ Supervised discretization takes the class information during discretization. It
works as follows:
➢ Let k be the number of different class labels (cat,dog,cow….).
➢ Let mi be the number of values present in the ith interval of a partition
➢ Let mij be the number of values that are misclassified i.e. They are of class j
but are present in “ith partition”.
➢ Then we calculate entropy ei of the ith interval is given by the equation :

where pij = mij/mi

If mij is 0 i.e. there is not even one value such that it is of class j but has been put in
ith position then entropy ei = 0.
So if an interval ‘i’ contains all samples of the same class then entropy is 0 .
Categorical Attributes with Too Many Values

✓ Sometimes the categorical attribute can take values from too


many choices.
✓ Example: Assume that there is a university with 100 departments.
If we have a dataset of students and we have a column called
department then a cell for a student can take one out of 100
department names. For a categorical attribute having so many
possible values is undesirable.

One way to resolve this is to club many categorical values into a


new categorical value.
Example: CSE, ISE, ECE, ME etc can all be clubbed into just one
new name called engineering.
Variable Transformation

• This is the 7th and final data preprocessing step/approach.


Variable Transformation means changing the value of attributes. We will learn
more below.
There are two types of variable transformation:
a) Using Simple Functions
b) Normalization or Standardization
Let us look at both the methods.

(a) Using Simple Functions


✓ This method is pretty straight forward:
✓ Take the values from the dataset table
✓ Apply one of the functions like xk, log x, ex, √x, 1/x, sin x, or |x| on this
value.
✓ Whatever answer you get, put the answer in place of original value in the
table.
✓ So the table now contains new values obtained after applying the function.
Note: If you apply this function then it should be applied to the specific column
for all the rows.This method should be applied carefully because it can alter
the table values in unexpected ways. For example if the column has values
{1,2,3} and if you apply 1/x then the values get reciprocal values {1,½,⅓}.
Normalization or Standardization

▪ This method of variable transformation is generally used in many AI,ML


models. This is important
Many ML algorithms have a fascination for columns that contain large value
numbers.
For example, Consider this dataset
Employee Name Age Salary

John 27 1,20,000

Rita 31 1,65,000

Nancy 33 1,90,000

If we pass these values as is to an ML algorithm, it notices that the salary column


has big numbers whereas the “age” column has smaller numbers. So it thinks
that salary is more important than age and hence it ignores “age” information.
Now this is not correct, “age” column has its own importance and salary has its
own importance. So how can we make the ML algorithm to treat both columns
with equal importance?
One way is to normalize both columns i.e convert the values to
numbers between -1 and 1 with mean as 0 and how do we do that?
a) If x̄ is the mean (i.e.average) of the column values.
b) If sx is the standard deviation
c) Then each value ‘x’ in the column is standardized (or normalized)
using the formula
This x’ will lie between -1 and 1 so our above table could look as
Employee Name Age Salary

John 0.3 0.4

Rita 0.35 0.55

Nancy 0.38 0.6

Since values of age and salary are now in same range, both columns will be treated
with equal priority.
Note: Our data can have outliers that might skew the mean (x̄) and standard
deviation (sx). To avoid this, sometimes instead of mean we use median (i.e
middle value) and standard deviation is replaced by absolute standard deviation
where µ is the median.
Measures of Similarity and Dissimilarity
Basics:
(a) Definitions:
✓ Similarity: Similarity between two objects (i.e rows) is a numerical measure
of the degree to which the two objects are alike.
✓ Similarity is often between 0 (not similar) to 1 (completely similar).
✓ Dissimilarity: Dissimilarity between two objects is a numerical measure of
the degree to which the two objects are different.
✓ If two objects are alike, the dissimilarities are lower
✓ We use distance as a synonym for dissimilarity.
(b) Transformation:
We generally apply transformation to do the following:
✓ Convert a similarity value to a dissimilarity, or vice versa
✓ Convert the values from one range (say(0 to 10)) to another range (say (0 to
1)).
✓ Example for need for transformation: Assume that we calculated proximity
and assigned values between (0 to 10). Now we use a software or package
from some other company that uses the range [0,1] then we will have to
transform the values from (0 to 10) to [0 to 1].
We can use the formula
s’ = (s−1)/9 ,
✓ where s is the similarity value between 0 to 10 and s’ is the new value
between 0 to 1.
In general the formula for new similarities is:
s’ = (s−min_s)/(max_s−min_s) ,
Similarly,to convert dissimilarity values can be converted from any range to new
range [0,1] using the formula
d’ = (d − min_d)/(max_d − min_d)
Exception: What if the original values are between (0 to ∞). In this case we can
use the below formula:
d’ = d/(1 + d)
✓ Using this formula original values 0, 0.5, 2, 10, 100, and 1000 get converted to
0, 0.33, 0.67, 0.90, 0.99, and 0.999.
✓ So large values get closer to 1.
Converting dissimilarity to similarity:

✓ In the above example we only change the range for either


similarity or dissimilarity values
✓ In this top we convert values from similarity to dissimilarity or
vice versa
✓ Case 1: If similarity is between [0,1] it is converted to
dissimilarity using the formula ,d=1-s
✓ Correctly dissimilarity is in the range of [0,1] is converted to
similarity using s=1-d
✓ Case 2: Negative: If Similarity values are positive, we can make
the dissimilarity as negative
✓ Example: If similarity was 0, 1, 10, 100 then dissimilarity will be
0, -1, -10, -100.
✓ Case 3: We can use another formula s= 1/(d+1)
✓ In this case if d=0, 1, 10, 100 then the corresponding
values will be s=1, 0.5, 0.09, 0.01
✓ Case 4: We can use another formula s=e-d
✓ In this case d=0, 1, 10, 100 will get converted to s=1.00,
0.37, 0.00, 0.00
Case 5: Another formula can be
s=1-(d-min_d)/(max_d-min_d)
In this case d=0, 1, 10, 100 will get converted to s=1.00,
0.99, 0.00, 0.00
Similarity and Dissimilarity between Simple Attributes

✓ In order to understand proximity between two objects we have to find


proximity for all columns of these two objects(i.e. rows)
✓ For simplicity, in this topic, we will try to understand proximity between
just one attribute(column).
We know there are 4 types of attributes (i.e. columns)
a) Nominal: These types of attributes have values that are just names.
So we say that nominal attributes are similar if their values match,
else they are dissimilar. Look at rows 1 of table 2.7
b) Ordinal: Ordinal attributes are those that can be arranged in
increasing or decreasing order. So, in case of ordinal attributes we
map the value to integers between 0 to n-1. And then for
dissimilarity we calculate
d=(int-value_of_x-int_value_of_y)/n-1
i.e d=(x-y)/n-1
where x, y are the two attributes being compared.
Ex: Quality of product = { poor, fair, ok, good, best}
we can map these = { 0, 1, 2, 3, 4}
assume we have a dataset as below:
Product Quality

p1 3

p2 2

then we calculate dissimilarity d=3-2/4=0.25


for ordinal attributes we calculate similarity using formula
s=1-d
so for the above example similarity s=1-0.25=0.75
c) Interval or Ratio: These attributes take regular integer values. In this case we
calculate dissimilarity using the formula d=|x-y|.
and for similarity we can use any of the formulas like
s=-d or
s=1/(d+1) or s=e^-d or
s=1- (d-min-d)/(min-d-max-d)
Dissimilarities between Data Objects
In previous section (2.4.2) we explained similarity/dissimilarity using just one
attribute
In this section we will look at similarity and dissimilarity of 2 objects (rows)
when they have two or more attributes(i.e columns)
Let us take example of 2 objects a and b
Let both of theses have 2 attributes as shown below
a=(x1, y1)
b=(x2, y2)
Distance: Distance between two objects is one way to find similarity and
dissimilarity. So for above case, we calculate the distance as
d=√((x2-x1)2 + (y2-y1)2)
In general if we have n-dimension data then euclidean distance is computed as
Euclidean distance can be further generated to get an
equation called Minkowski distance metric

✓ where r=1 the value of d(x,y) is called hamming distance. This is


also called L1
✓ when r=2 we get back the euclidean. This is also called L2 norm-
distance.
✓ when r=∞ it is called Supernum distance, it is symbolized as L∞
and is defined as
Example: Table 2.8 shows the (x1, y1) values for 4 points from
figure 2,15
i.e. p1=(0, 2), p2=(2, 0), p3=(3, 1) and p4=(5, 1)
Using the Minkowski formula
If r=1 we get table 2.10 (i.e use the Minkowski formula,
substitute r=1 and calculate the distance between p1, p2, p3
and p4.
If r=2 then Minkowski formula gives the Euclidian distance
between p1, p2, p3 and p4 This is captured in Table 2.9
For r=∞ the values are captured in Table 2.11.
Properties of euclidean distance
✓ If d(x,y) is the distance between two points (x, y) then following properties
hold:
Positivity:
a) d(x,x) >=0 for all x and y
b) d(x,y) = 0 only when x=y
Symmetry:
d(x,y) = d(y,x) for all x and y
Triangle Inequality:
d(x,z) <= d(x,y) + d(y,z) for all points x, y and z.
If measures satisfy all the three properties listed above they become metric
Some dissimilarities do not satisfy all those 3 properties. So these
dissimilarities cannot be called metric. Let us look at one such example.
Example 1: Set difference:
In set theory if ‘A’ is one set and ‘B’ is another set then difference between
the two sets A and B is computed as :
A-B which is = elements in A that are not in B
Ex: If A={1,2,3,4}, B={2,3,4} then
A-B={1} where as
B-A=NULL i.e. empty set.
If distance in this case taken as size of two set then
d(A-B) = 1 where as
d(B-A) = 0
This distance for sets does not satisfy symmetric properties as we saw just
now.
It also does not satisfy the 3rd property i.e. triangle property.
So we can say that ‘set difference’ is not a ‘metric’ as it fails to satisfy all 3
properties.
Example 2 : Time
Time is another case where the distance (i.e. difference of two
time values) is not a metric. The difference between two lines is
written using the formula

Let us take some examples :


d(1PM, 2PM) = 1 hour
whereas,
d(2PM, 1PM) = 23 hours.
(i.e. we go to the next day).
So clearly, the symmetry property does not hold
good.
Hence distance (i.e. difference between two time
values) is not a metric as symmetric property
failed.
Similarities between Data Objects

In the previous section (i.e 2.4.3) we tried to find


dissimilarities between two objects (i.e. rows) when
each object has 2 or more attributes.
This section looks at formulas for similarity between
two objects with 2 or more attributes.
Well it is quite straightforward for example if
x=(1, 2, 3, 4, 9, 25) is one object then it is similar to
another object only when y=(1, 2, 3, 4, 9, 25) in
general we write that properties that similarities
follow are :
1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
Examples of Proximity Measures
In this topic we take some specific examples of similarity and dissimilarity
measures.
a) Similarity Measures for Binary Data
Suppose we have two objects (i.e. rows) with only binary attributes (columns)
then how do we say they are similar?
So if x and y are 2 objects (rows) with n binary attributes (columns) then we get
4 quantities or possibilities.
f00 = the number of attributes (columns) where x = 0 and y = 0
f01 = the number of attributes (columns) where x = 0 and y = 1
f10 = the number of attributes (columns) where x = 1 and y = 0
f11 = the number of attributes (columns) where x = 1 and y = 1
Simple Matching Coefficient(SMC)
SMC computes the similarity coefficient using the following formula.

Well the formula simply says count the number of columns where the value
matches, then divide this value with total columns in the dataset.
Jaccard Coefficient(J)

Jaccard Coefficient is a slightly different coefficient compared to the SMC. In


case of Jaccard Coefficient we count only the match where both columns have a
‘1’ and also ignore f00 .

Example : Calculate SMC and J for the following data :


x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
f00 = attribute where x = 0 and y = 1 = 7
f01 = attribute where x = 1 and y = 0 = 2
f10 = attribute where x = 0 and y = 0 = 1
f11 = attribute where x = 1 and y = 1 = 0
Cosine Similarity
When we compare documents (say Doc A and Doc B), if we list all words from the
documents as vectors/objects then usually many words in one doc may not be
found in another document.
So there will be many 0-0 matches between the two documents i.e.
Doc A = {1,0,0,0,0,1,1,0,0,0,0,0,0}
Doc B = {0,0,0,0,1,0,0,0,0,1,0,0,0}
If we count all these 0-0 matches then we wrongly conclude that the documents
are similar which is wrong.
The Jaccard Coefficient that we saw in the last topic avoids this by not
considering 0-0 matches.
But Jaccard Coefficient limits itself to binary vectors only.
Cosine similarity is one of the most common methods used to find
document similarity. If ‘x’ and ‘y’ are two document vectors then cosine
similarity is calculated as :

x.y is the dot product of vectors x and y


||x|| is the length of vector x,
||y|| is the length of vector y.
So

and
Example : If x and y are two document vectors as shown below. Find the document similarity.
x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

Substituting 1,2,3 in * we get

cos(x,y) = 5/(6.48*2.24)
cos(x, y) = 0.31
Cosine Similarity is the cosine angle between x and y. So if cosine similarly = 1
then angle between x and y is 0 because cos 0 = 1 .
and if similarity between x and y is 0 then the angle between x and y is 90°
because cos 90° = 0
The cosine equation is also written as :
Extended Jaccard Coefficient (Tanimoto Coefficient)
We know that Jaccard Coefficient is applied only for binary
coefficients. However there is an Extended Jaccard Coefficient also
called Tanimoto Coefficient that can be used for document
comparison.

Correlation tries to find out how related are two objects (rows) i.e. linear
relation between two objects.
If x and y are two objects then we compute correlation using the formula
Bregman Divergence

Bregman Divergence is a loss function.


If y is the original vector(i.e. point) and ‘x’ was obtained from ‘y’ by distorting ‘y’
with some noise.
Then our aim is to find the loss we have to incur if we decide to use ‘x’ as an
appropriate version of ‘y’.
If ‘x’ was very close to ‘y’ then loss will obviously be less.
To understand the Bregman Divergence we should know the vector calculus.
Definition (Bregman Divergence) :
If φ is a convex function then divergence (loss function) D(x, y) generated by this
function is computed as

where ∇φ(y) is the gradient of φ evaluated at y,


x−y, is the vector difference between x and y, and
∇φ(y),(x − y) is the inner product between ∇φ(x) and (x − y).
Issues in Proximity Calculation

Till now we saw how to compute the distance between two objects (i.e. rows) so that
we can evaluate their proximity. But there are 3 issues in computing distance like the
way we saw.
(1) How to handle the cases in which attributes have different scales and/or what if the
attributes (i.e. columns) are correlated.
(2) How to find this distance between rows if some columns are qualitative
(red,blue,green eyes) and some columns are quantitative (salary).
(3) Sometimes we can have attributes such that some are more important than others.
Standardization and Correlation for Distance Measures
This topic says that if we have attributes (i.e. columns) that are correlated to each other
then distance should not be computed using Euclidean distance, instead we should use
Mahalanobis distance as shown :
Example 2.23 : The two big points in figure 2.19 have a large Euclidean distance of
14.7. But these points are closely related. Why?
Because as x increases y is also increasing so they are related
So, when we calculate Mahalanobis distance between the two large dots we get 6
which means they are related.
Combining Similarities for Heterogeneous Attributes

Now let us look at point (b) given above. WHat do we do when the dataset has
attributes of different types?
Algorithm 2.1 tells how to compute the similarity.
Basically the algorithm says the following :
First compute the similarity value for all attributes. So for kth attribute we call it
Sk(x,y)
Define a value k. δk will be zero (0) if
kth attribute is asymmetric and both x and y have ‘0’ in the column
or we are missing a value in this cell.

You might also like