Module 2 DMDW
Module 2 DMDW
(21IS643)
Bitmap Indexing
Join Indexing
OLAP query Processing
1,3 and 4
OLAP Server Architecture ROLAP v/s MOLAP v/s HOLAP
• OLAP server show the multidimensional data to the user(in the form of
charts, graphs, tables etc).
• But how does this data get stored inside the warehouse?
• There are three ways to implement the storage of the warehouse server.
✓ Relational OLAP(ROLAP) Servers
▪ These servers sit between the relational backend server(ie say the
operational database containing ecommerce transactions) and the front
end tools (like charts etc).
▪ They are responsible for storing the warehouse data. They use
relational DBMS or extended relational DBMS to store this
warehouse data.
▪ They also store middleware data
▪ ROLAP are more scalable compared to MOLAP
➢ Each row is NOT transactions of users but each row is actually
warehouse data from one of the cuboids.
Multidimensional OLAP(MOLAP) servers
➢ The IRIS dataset is a data set containing petal length & petal width
for 3 flowers iris species: setosa, versicolor or virginica.
➢ Two more features: sepal length & sepal width
*Note : Whoever Buys diapers also buys Milk {Diaper} -> {Milk}
Cluster Analysis
• This is the third type of data mining task that tries to find or
make groups of related observations.
• ie all related items are clustered together
Ex Document Clustering
Good clustering
Cluster 1: Related to Economy algorithms will find out
Cluster 2: Related to Health these two clusters
Anomaly Detection
• This is the 4th type of data mining task that tries to identify
observations that are significantly different from rest of the
data.
• Aim is to identify anomalies & avoid false detection/labelling
normal objects as anomalies.
• A good anomaly detection algorithm has high detection rate &
low false alarm rate.
Attributes:
Each attribute captures one basic characteristic of an object(each column of
the dataset is called an attribute)
It is also called variable, characteristic, field, feature, dimension
Attributes and Measurement
What is an Attribute?
An attribute is a property or characteristic of an object that may
vary, either from one object to another or from one time to another.
Example: color of eye is an attribute of a person object. Each
person has a different eye color. So, if we have a “person” dataset
with each row having information of 1 person then eye color will be
one attribute of a person(ie one column of a dataset)
Attributes where only non zero values is important are called asymmetric
attributes
In case of such attributes we should focus only on non zero values and ignore
Ex:- A record of a student will have ‘1’ for a course that a student has taken and
‘0’ for a course that he or she has not taken. If you concentrate on ‘0’ then we
will be looking at a lot of attributes with values ‘0’ as students take only a few
courses. In such cases we should concentrate on attributes that have ‘1’ for the
course.
Types of Data sets:-
● Record data
● Ordered data
a) Dimensionality
b) Sparsity
c) Resolution
a) Dimensionality
● Dimensionality means number of attributes i.e. number of columns in a data
set.
● Higher dimensionality, that is more columns, is not preferred. So we generally
apply dimensionality reduction techniques to reduce dimensions(i.e. columns)
b) Sparsity
● Sparsity means to arrange data in such a way that we can concentrate only
on non zero attributes.
● By storing sparse data we save space and computation time.
c) Resolution
● It is important to decide at what resolution the database should be stored.
● Ex:- Surface of earth will look uneven if we look at Earth at metre level
resolution but if you look at Earth at km level resolution Earth looks smooth.
● Ex:- If we look at the weather pattern at the granularity of 'hours’ then we get
details on Storms etc but if you look at granularity of ‘months’ we might miss
them.
Dataset Type1
=)If every row of the data set needs all the attributes and all attributes
take some real values /integer values then we call such data a Data
Matrix.
=)Such data Matrix will be m x n matrix with 'm' rows 1 for each
object and 'n' columns 1 for each attribute
=) Example is the Iris flower data set where each row is for one
sample flower with four columns ( sepal length , petal length, sepal
width, petal width)
Such data set will have m x 4 dimensions i.e. ‘n’ rows for n different
flower samples and each row with 4 columns.
The Sparse Data Matrix
The data that we collect for data mining can have issues with the quality of data.
We will look at some aspects of data quality here.
Measurement & Data Collection Issues
We can have issues while
measuring the values
collecting the data
Measurement Errors
Any problems arising from the measurement process
✓ ie value recorded differs from the true value
✓ The difference between recorded value & true value is measurement error.
Data Collection Error
✓ Errors introduced during data collection process
✓ The person recording the data missed typing few values for few rows is the
dataset. So either the objects (rows) are missing/duplicated or the feature
(columns) are missing values.
a) Noise and Artifacts
When we are measuring the values, a random component may get added
to the measurement and hence distort the values.
Fig shows how the time series data got distorted because noise got
introduced.
Similarly next Fig shows noise points (i.e '+') that got added to spatial
data. Generally noise gets added to temporal(ie time series data like a
sound wave) or spatial data (like temperature at different places).
We can use many signal processing and image processing techniques to
reduce the noise.
Quantifying the Error: Precision, Bias, Accuracy
Sampling Approach
1 Random sampling
sampling without replacement
sampling with replacement
2 Stratified sampling
equal objects drawn
proportional objects drawn
3 Progressive sampling
Random Sampling
This ensure that the same row does not get picked again when we
sample the next sample
so there is a chance that the same row may get picked twice during
sampling process
Stratified sampling
(a) Many data mining algorithms work best when the number of
column (dimensions) are less.
• In this step we try to remove features (ie columns) that are not helpful .There
are two ad hoc and 3 systematic approaches for features selection
But what if I say that this is actually a cone which you are
viewing from top.
A cone’s top view looks like a circle.
Fourier transform
1. Generally an electromagnetic signal in the time
domain looks like, say, a sine wave.
2. When we analyse the signal in the time domain we
may not notice any issues or anomalies.
3. However when we use fourier transform and convert
the domain signal to frequency domain signal then we
might observe noise in the signal.
Fourier Analysis
Fourier Analysis
● Example:
We have a dataset of historical excavated items.For each item we
have its mass and volume.But we want to find the material from
which the item was made
● One way to do it is to calculate mass/ volume to get the density.The
value of density gives us an idea of material used to make it.
Discretization and Binarization
● Now this x1,x2,x3 could have been the column that could have
replaced the categorical column inside our data set.
● But there is a problem ,note that for both “good” and “OK”, the
column “x2” has 1,we accidentally added a correlation between two
columns / attributes.
● So even though this kind of helps but this is not the final solution
Step 3:Instead , convert each integer value to asymmetric
binary attributes.
These ‘5’ columns can now be added to the dataset and we can remove
the column “categorical values” from the dataset .In essence we
binarized the categorical value attribute.
Example 2:A binary attribute can also be replaced with two asymmetric
binary attributes
For example Gender male & female is replaced as below
Discretization of Continuous Attributes
Step 1:Sort the values and divide them into ‘n’ intervals by specifying
n-1 split points.
Ex:If we have values like 5.4,12.9,6.8,25.4,16.5 then we specify 3
intervals with two split points i.e 10 and 20.In this scenario 5.4 & 6.8 will
be to the left of 10 , where as 12.9 & 16.5 will be between 10 & 20 And
finally 25.4 will to right of 20.
Step 2:Round off these values to a pre agreed value within the
interval.
Ex:5.4 & 6.8 may get rounded to say ‘5’. 12.9 & 16.5 may get rounded
to say 15.
✓ Equal Frequency: In this approach we draw the split lines such that
each interval has same number of points. In Figure 2.13(c) we draw
the lines such that if (0 to 5.5) interval has 120 points then the next
group say (5.5 to 9) will also have 120 points and next group (9 to
14.8) will also have 120 points and so on.
✓ Unsupervised discretization does not take the class labels into consideration
when it performs discretization.
✓ Supervised discretization takes the class information during discretization. It
works as follows:
➢ Let k be the number of different class labels (cat,dog,cow….).
➢ Let mi be the number of values present in the ith interval of a partition
➢ Let mij be the number of values that are misclassified i.e. They are of class j
but are present in “ith partition”.
➢ Then we calculate entropy ei of the ith interval is given by the equation :
If mij is 0 i.e. there is not even one value such that it is of class j but has been put in
ith position then entropy ei = 0.
So if an interval ‘i’ contains all samples of the same class then entropy is 0 .
Categorical Attributes with Too Many Values
John 27 1,20,000
Rita 31 1,65,000
Nancy 33 1,90,000
Since values of age and salary are now in same range, both columns will be treated
with equal priority.
Note: Our data can have outliers that might skew the mean (x̄) and standard
deviation (sx). To avoid this, sometimes instead of mean we use median (i.e
middle value) and standard deviation is replaced by absolute standard deviation
where µ is the median.
Measures of Similarity and Dissimilarity
Basics:
(a) Definitions:
✓ Similarity: Similarity between two objects (i.e rows) is a numerical measure
of the degree to which the two objects are alike.
✓ Similarity is often between 0 (not similar) to 1 (completely similar).
✓ Dissimilarity: Dissimilarity between two objects is a numerical measure of
the degree to which the two objects are different.
✓ If two objects are alike, the dissimilarities are lower
✓ We use distance as a synonym for dissimilarity.
(b) Transformation:
We generally apply transformation to do the following:
✓ Convert a similarity value to a dissimilarity, or vice versa
✓ Convert the values from one range (say(0 to 10)) to another range (say (0 to
1)).
✓ Example for need for transformation: Assume that we calculated proximity
and assigned values between (0 to 10). Now we use a software or package
from some other company that uses the range [0,1] then we will have to
transform the values from (0 to 10) to [0 to 1].
We can use the formula
s’ = (s−1)/9 ,
✓ where s is the similarity value between 0 to 10 and s’ is the new value
between 0 to 1.
In general the formula for new similarities is:
s’ = (s−min_s)/(max_s−min_s) ,
Similarly,to convert dissimilarity values can be converted from any range to new
range [0,1] using the formula
d’ = (d − min_d)/(max_d − min_d)
Exception: What if the original values are between (0 to ∞). In this case we can
use the below formula:
d’ = d/(1 + d)
✓ Using this formula original values 0, 0.5, 2, 10, 100, and 1000 get converted to
0, 0.33, 0.67, 0.90, 0.99, and 0.999.
✓ So large values get closer to 1.
Converting dissimilarity to similarity:
p1 3
p2 2
Well the formula simply says count the number of columns where the value
matches, then divide this value with total columns in the dataset.
Jaccard Coefficient(J)
and
Example : If x and y are two document vectors as shown below. Find the document similarity.
x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
cos(x,y) = 5/(6.48*2.24)
cos(x, y) = 0.31
Cosine Similarity is the cosine angle between x and y. So if cosine similarly = 1
then angle between x and y is 0 because cos 0 = 1 .
and if similarity between x and y is 0 then the angle between x and y is 90°
because cos 90° = 0
The cosine equation is also written as :
Extended Jaccard Coefficient (Tanimoto Coefficient)
We know that Jaccard Coefficient is applied only for binary
coefficients. However there is an Extended Jaccard Coefficient also
called Tanimoto Coefficient that can be used for document
comparison.
Correlation tries to find out how related are two objects (rows) i.e. linear
relation between two objects.
If x and y are two objects then we compute correlation using the formula
Bregman Divergence
Till now we saw how to compute the distance between two objects (i.e. rows) so that
we can evaluate their proximity. But there are 3 issues in computing distance like the
way we saw.
(1) How to handle the cases in which attributes have different scales and/or what if the
attributes (i.e. columns) are correlated.
(2) How to find this distance between rows if some columns are qualitative
(red,blue,green eyes) and some columns are quantitative (salary).
(3) Sometimes we can have attributes such that some are more important than others.
Standardization and Correlation for Distance Measures
This topic says that if we have attributes (i.e. columns) that are correlated to each other
then distance should not be computed using Euclidean distance, instead we should use
Mahalanobis distance as shown :
Example 2.23 : The two big points in figure 2.19 have a large Euclidean distance of
14.7. But these points are closely related. Why?
Because as x increases y is also increasing so they are related
So, when we calculate Mahalanobis distance between the two large dots we get 6
which means they are related.
Combining Similarities for Heterogeneous Attributes
Now let us look at point (b) given above. WHat do we do when the dataset has
attributes of different types?
Algorithm 2.1 tells how to compute the similarity.
Basically the algorithm says the following :
First compute the similarity value for all attributes. So for kth attribute we call it
Sk(x,y)
Define a value k. δk will be zero (0) if
kth attribute is asymmetric and both x and y have ‘0’ in the column
or we are missing a value in this cell.