DM DW Module2
DM DW Module2
I1 I2 I3 I4 I5 I6 All
New York
10 11 12 3 10 1 47
City
Chicago 11 9 6 9 6 7 48
Toronto 12 9 8 5 7 3 44
Vancouver 13 8 10 5 6 3 45
All
46 37 36 22 29 14 184
Item cuboid
2 n
L i 1
or i 1
If the attribute has the value v for a given row in the data table,
then the bit representing that value is set to 1 in the
corresponding row of the bitmap index. All other bits for that
row are set to 0.
It is efficient compared to hash and tree
indices.
E.g. fact table: Sales and two dimensions city and product
ROLAP tools feature the ability to ask any question because the
methodology does not limit to the contents of a cube.
ROLAP also has the ability to drill down to the lowest level of
detail in the database.
MOLAP stores this data in an optimized multi-
dimensional array storage.
It includes:
Fusing data from multiple sources
cleaning data to remove noise and duplicate
observations
selecting records and features that are
relevant to the data mining task at hand.
Ensures that only valid and useful results are
incorporated into the DSS.
High Dimensionality
Non-traditional Analysis
Predictive tasks: Predict the value of a
particular attribute based on the values of
other attributes.
◦ Target or dependent variable
◦ Explanatory or independent variables.
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
Refers to the task of building a model for the target
variable as the function of the explanatory
variables.
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
Given a collection of records (training set )
Set
Training
Learn
Model
Set Classifier
Cluster Analysis
Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering has been used to group sets of related customers
find areas of the ocean that have a significant impact on the Earth's
climate, and compress data.
Given a set of records each of which contain
some number of items from a given collection
TID Items
Rules Discovered:
1 Bread, Coke, Milk
{Milk} --> {Coke}
2 Beer, Bread {Diaper, Milk} --> {Beer}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Objects
dimension, or feature 4 Yes Married 120K No
A collection of attributes 5 No Divorced 95K Yes
describe an object
6 No Married 60K No
◦ Object is also known as
record, point, case, sample, 7 Yes Divorced 220K No
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
The following properties (operations) of numbers
are typically used to describe attributes.
◦ Distinctness: =
◦ Order: < >
◦ Differences are + -
meaningful :
◦ Ratios are * /
meaningful
female} test
Continuous Attribute
◦ Has real numbers as attribute values
◦ Examples: temperature, height, or weight, count
◦ Practically, real values can only be measured and
represented using a finite number of digits.
◦ Continuous attributes are typically represented as
floating-point variables.
Only presence (a non-zero attribute value) is
regarded as important.
◦ Sparsity
only the non-zero values need to be stored and manipulated
which improves computation time and storage.
◦ Resolution
obtain data at different levels of resolution, and often the
properties of the data are different at different resolution.
Patterns should not be too fine or too coarse, it would not be
visible.
Record
◦ Data Matrix
◦ Document Data
◦ Transaction Data
Graph
◦ World Wide Web
◦ Molecular Structures
Ordered
◦ Spatial Data
◦ Temporal Data
◦ Sequential Data
◦ Genetic Sequence Data
Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
A special type of record data, where
◦ Each record (transaction) involves a set of items.
◦ For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Frequently convey important information. In
such cases, the data is often represented as a
graph.
Spatio-Temporal Data
Record-oriented techniques can be applied to
non-record data by extracting features from
data objects and using these features to
create a record corresponding to each object.
Examples:
◦ Same person with multiple email addresses
◦ Two distinct persons with same name
Timeliness
Relevance
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc.
Days aggregated into weeks, months, or years
Purpose:
◦ Avoid curse of dimensionality
◦ Reduce amount of time and memory required by data
mining algorithms
◦ Allow data to be more easily visualized
◦ May help to eliminate irrelevant features or reduce noise
The curse of dimensionality refers to the
phenomenon that many types of data analysis
become significantly harder as the dimensionality of
data increases.
Techniques
Principal Components Analysis (PCA)
Singular Value Decomposition(SVD) is a linear algebra
technique that is related to PCA and is also commonly used
for dimensionality reduction.
(PCA) is a linear algebra technique for
continuous attributes that finds new
attributes (principal components) that
(1) are linear combinations of the original
attributes,
(2) are orthogonal (perpendicular) to each
other, and x2
(3) capture the maximum amount of variation
in the data. e
Another way to reduce dimensionality of data
Redundant features
◦ Duplicate much or all of the information contained
in one or more other attributes
◦ Example: purchase price of a product and the
amount of sales tax paid
Irrelevant features
◦ Contain no information that is useful for the data
mining task at hand
◦ Example: students' ID is often irrelevant to the task
of predicting students' GPA
Embedded approaches
Filter approaches
Wrapper approaches
Embedded approaches:
Feature selection occurs naturally as part of the data
mining algorithm.
Specifically, during the operation of the data mining
algorithm, the algorithm itself decides which attributes to
use and which to ignore.
Algorithms for building decision tree classifiers often
operate in this manner.
Filter approaches:
Features are selected before the data mining algorithm is
run, using some approach that is independent of the data
mining task.
For example, we might select sets of attributes whose
pairwise correlation is as low as possible.
Wrapper approaches:
These methods use the target data mining algorithm as a
black box to find the best subset of attributes, in a way
similar to that of the ideal algorithm described above, but
typically without enumerating all possible subsets
It is possible to encompass both the filter and
wrapper approaches within a common architecture.
The feature selection process is viewed as consisting
of four parts: a measure for evaluating a subset, a
search strategy that controls the generation of a new
subset of features, a stopping criterion, and a
validation procedure.
Filter methods and wrapper methods differ only in
the way in which they evaluate a subset of features.
For a wrapper method, subset evaluation uses the
target data mining algorithm, while for a filter
approach, the evaluation technique is distinct from
the target data mining algorithm.
For the filter approach, evaluation measures
attempt to predict how well the actual data
mining algorithm will perform on a given set
of attributes.
For the wrapper approach, where evaluation
consists of actually running the target data
mining application, the subset evaluation
function is simply the criterion normally used
to measure the result of the data mining.
The stopping criteria is usually based on one or more
conditions involving the following: the number of
iterations, whether the value of the subset evaluation
measure is optimal or exceeds a certain threshold,
whether a subset of a certain size has been obtained,
whether simultaneous size and evaluation criteria have
been achieved, and whether any improvement can be
achieved by the options available to the search strategy.
Dissimilarity measure
◦ Numerical measure of how different two data objects are
◦ Lower when objects are more alike
◦ Minimum dissimilarity is often 0
◦ Upper limit varies
r = 2. Euclidean distance
yi = xi 2
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
𝐻 𝑋 =− 𝑝𝑖 log 2 𝑝𝑖
𝑖=1
𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗
Where pij is the probability that the ith value of X and the jth value of
Y occur together