0% found this document useful (0 votes)
14 views

DM DW Module2

This document discusses data cube computation and OLAP techniques. It describes creating a data cube with dimensions for item, city, year, and sales. It discusses computing aggregates by grouping on different dimension combinations. Materializing all or some cuboids is described as well as techniques for query processing and improving performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

DM DW Module2

This document discusses data cube computation and OLAP techniques. It describes creating a data cube with dimensions for item, city, year, and sales. It discusses computing aggregates by grouping on different dimension combinations. Materializing all or some cuboids is described as well as techniques for query processing and improving performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

 Efficient Data cube computation

 Data access methods


 Query processing techniques
 You would like to create a data
cubeAll_Electronics that contains the
following:
 item, city, year, and sales_in_Euro

 Answer following queries


◦ Compute the sum of sales, grouping by item and
city
◦ Compute the sum of sales, grouping by item
◦ Compute the sum of sales, grouping by city
◦ Compute the total sales( empty group by - 0D -
apex)
 The total number of data cuboids is 23=8
◦ {(city,item,year),
◦ (city,item), (city,year),(item,year)
◦ (city),(item),(year),
◦ ()}

 (), group by is empty - the dimensions are not


grouped
 These group-by‟s form a lattice of cuboids for the data
cube

 The base cuboid contains all three dimensions


Base cuboid
Item City cuboid

I1 I2 I3 I4 I5 I6 All

New York
10 11 12 3 10 1 47
City

Chicago 11 9 6 9 6 7 48

Toronto 12 9 8 5 7 3 44

Vancouver 13 8 10 5 6 3 45
All
46 37 36 22 29 14 184
Item cuboid

Aggregate cell Base cell Apex Cuboid


 A cube operator on n dimensions is equivalent to a
collection of group-by statements, one for each
subset of the n dimensions.

 The compute cube operator computes aggregates


over all subsets of the dimensions specified in the
operation.

 Similar to the SQL syntax, the data cube could be


defined in DMQL as:
◦ define cube sales [item, city, year]: sum (sales_in_dollars)

 Compute the sales aggregate cuboids as:


◦ compute cube sales
• Fast on-line analytical processing takes
minimum time if aggregates for all the cuboids
are precomputed.

 fast response time(enhances performance of OLAP)


 avoids redundant computations

 Pre-computation of the full cube requires


excessive amount of memory and depends on
number of dimensions and concept hierarchy of
each dimensions.

 Called as Curse of dimensionality


 How many cuboids are there in an n-
dimensional data cube with dimension
cardinality ?”

 Many dimensions may have hierarchies, for example time


 day < month < quarter < year

 the total number of cuboids that can be


generated is
n
T   ( Li  1)
i 1

 Where Li is the number of levels for dimension i.


 Data cube materialization/ pre-computation

◦ No materialization: Don‟t precompute any of the


non-base cuboid. Leads to multidimensional
aggregation on the fly and is slow.

◦ Full materialization: Precompute all the cubes.


Running queries will be very fast. Requires huge
memory.

◦ Partial Materialization: Selectively compute a proper


subset of the cuboids, which contains only those
cells that satisfy some user specified criterion.
 A popular approach is to materialize the
cuboids set on which other frequently
referenced cuboids are based.

 The partial materialization of cuboids or


subcubes should consider three factors:
(1) identify the subset of cuboids or subcubes to
materialize
(2) exploit the materialized cuboids or subcubes
during query processing
(3) efficiently update the materialized cuboids or
subcubes during load and refresh.
 Full cube(Full materialization): All cells and cuboids are
materialized. All possible combination of dimensions and
values. n

2 n
L i 1
or i 1

 Iceberg cube: Materializing only the cells in a cuboid


whose measure value is above the minimum support
threshold.
count(*) >= min support Iceberg Condition
 Closed cube: No ancestor cell is created if its measure is
equal to that of its descendent cell.
 Shell cube: precomputing the cuboids for only a small
number of dimensions (e.g., three to five) of a data cube
 Bitmap Index is popular in OLAP products because it allows
quick searching in data cubes.

 The bitmap index is an alternative representation of the record


ID (RID) list.

 In the bitmap index for a given attribute, there is a distinct bit


vector, Bv, for each value v in the attribute‟s domain.

 If a given attribute‟s domain consists of n values, then n bits are


needed for each entry in the bitmap index (i.e., there are n bit
vectors).

 If the attribute has the value v for a given row in the data table,
then the bit representing that value is set to 1 in the
corresponding row of the bitmap index. All other bits for that
row are set to 0.
 It is efficient compared to hash and tree
indices.

 It is useful for low cardinality domains


because the processing time is reduced.

 It leads to reductions in space and IO.

 It can be adapted for higher cardinality


domains using compression techniques.
 Traditional indexing maps the value in a given column to a
list of rows having that value.

 In contrast, join indexing registers the joinable rows of


two relations from a relational database.

 In data warehouses, join index relates the values of the


dimensions of a star schema to rows in the fact table.

 E.g. fact table: Sales and two dimensions city and product

 A join index on city maintains for each distinct city a list of


R-IDs of the tuples recording the Sales in the city.

 Join indices can span multiple dimensions – composite join


index
1. Determine which operations should be performed on the
available cuboids
◦ e.g., dice = selection + projection
2. Determine which materialized cuboid(s) should be selected
for OLAP operation.
◦ the one with low cost

 Suppose that we define a data cube for AllElectronics of the


form “sales cube [time, item, location]: sum(sales in dollars).”
 The dimension hierarchies used are
 “day < month < quarter < year” for time;
 “item name < brand < type” for item;
 “street < city < province or state < country” for location.
Let the query to be processed be on
{brand, province_or_state} with the condition “year = 2004”,
and there are 4 materialized cuboids available:

1) {year, item_name, city}


2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004

Which should be selected to process the query?


 ROLAP works directly with relational databases.

 Has greater scalability than MOLAP.

 Uses a relational or extended-relational DBMS to store and


manage warehouse data.

 optimization for each DBMS, implementation of aggregation


navigation logic, and additional tools and services

 ROLAP tools do not use pre-calculated data cubes but instead


pose the query to the standard relational database.

 ROLAP tools feature the ability to ask any question because the
methodology does not limit to the contents of a cube.

 ROLAP also has the ability to drill down to the lowest level of
detail in the database.
 MOLAP stores this data in an optimized multi-
dimensional array storage.

 The advantage of using a data cube is that it allows


fast indexing to precomputed summarized data.

 MOLAP tools have a very fast response time.

 The data cube contains all the possible answers to a


given range of questions.

 MOLAP servers adopt a two-level storage


representation to handle dense and sparse data sets.

 Sparse subcubes employ compression technology for


efficient storage utilization
 HOLAP server may allow large volumes of detailed
data to be stored in a relational database

 Aggregations are kept in a separate MOLAP store.

 HOLAP tools can utilize both pre-calculated cubes


and relational data sources.

 The hybrid OLAP approach combines ROLAP and


MOLAP technology, benefiting from the greater
scalability of ROLAP and the faster computation of
MOLAP.

 The Microsoft SQL Server 2000 supports a hybrid


OLAP server.
 To meet the growing demand of OLAP
processing in relational databases, some
database system vendors implement
specialized SQL servers.

 They provide advanced query language and


query processing support for SQL queries
over star and snowflake schemas in a read-
only environment.
 Is a set of facts/observations/measurements
about objects/events/processes of interest.
 Is a processed data that is useful in one or
the other way. (decision making,
communication, etc)

 While the data is fixed, information from it


can differ based on needs.
 Data mining is the process of automatically
discovering useful information in large data
repositories.

 Finding hidden information in a database.

 Data mining techniques are deployed to scour


large databases in order to find novel and useful
patterns that might otherwise remain unknown.

 They also provide capabilities to predict the


outcome of a future observation.
 Looking up individual records using a database
management system.

 finding particular Web pages via a query to an


Internet search engine.

 Above are tasks related to the area of information


retrieval.
Data mining is an integral part of knowledge
discovery in databases (KDD), which is the overall
process of converting raw data into useful
information,
 The input data can be stored in a variety of
formats.

 To transform the raw input data into an


appropriate format for subsequent analysis.

 It includes:
 Fusing data from multiple sources
 cleaning data to remove noise and duplicate
observations
 selecting records and features that are
relevant to the data mining task at hand.
 Ensures that only valid and useful results are
incorporated into the DSS.

 Which allows analysts to explore the data and


the data mining results from a variety of
viewpoints.

 Testing methods can also be applied during


post processing to eliminate spurious data
mining results.
 Scalability

 High Dimensionality

 Heterogeneous and Complex Data

 Data Ownership and Distribution

 Non-traditional Analysis
 Predictive tasks: Predict the value of a
particular attribute based on the values of
other attributes.
◦ Target or dependent variable
◦ Explanatory or independent variables.

 Descriptive tasks: Derive patterns that


summarize the underlying relationships in
data.
◦ Postprocessing techniques are used to validate and
explain the result.
Data Mining Tasks …

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk
 Refers to the task of building a model for the target
variable as the function of the explanatory
variables.

 There are two types of predictive modeling tasks:


 classification: which is used for discrete target
variables.
 regression, which is used for continuous target
variables.

 example predicting whether a web user will make a


purchase at an online book store.
 forecasting the future price of the stock.
Predictive Modeling: Classification
 Find a model for class attribute as a function
of the values of other attributes
Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No
 Given a collection of records (training set )

 Each record contains a set of attributes, one of the


attributes is the class.

 Find a model for class attribute as a function of the


values of other attributes.

 Goal: previously unseen records should be assigned a


class as accurately as possible.

 A test set is used to determine the accuracy of the


model.

 Usually, the given data set is divided into training and


test sets, with training set used to build the model and
test set used to validate it.
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … … Test
10

Set

Training
Learn
Model
Set Classifier
Cluster Analysis
 Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Clustering has been used to group sets of related customers
find areas of the ocean that have a significant impact on the Earth's
climate, and compress data.
 Given a set of records each of which contain
some number of items from a given collection

 used to discover patterns that describe


strongly associated features in the data.

 The discovered patterns are typically


represented in the form of implication rules
or feature subsets
 Market-basket analysis
◦ Rules are used for sales promotion, shelf management, and
inventory management

TID Items
Rules Discovered:
1 Bread, Coke, Milk
{Milk} --> {Coke}
2 Beer, Bread {Diaper, Milk} --> {Beer}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

 finding groups of genes that have related functionality

 identifying Web pages that are accessed together

 understanding the relationships between different


elements of Earth's climate system.
 task of identifying
observations whose
characteristics are
significantly different from
the rest of the data.

 Such observations are known


as anomalies or outliers.

 The goal of an anomaly


detection algorithm is to Applications:
discover the real anomalies 1. Credit Card Fraud
and avoid falsely labeling Detection
normal objects as
2. Network Intrusion
anomalous.
Detection
 Collection of data objects Attributes
 An attribute is a property
or characteristic of an
Tid Refund Marital Taxable
object Status Income Cheat
◦ Examples: eye color of a
person, temperature, etc. 1 Yes Single 125K No

◦ Attribute is also known as 2 No Married 100K No


variable, field, characteristic, 3 No Single 70K No

Objects
dimension, or feature 4 Yes Married 120K No
 A collection of attributes 5 No Divorced 95K Yes
describe an object
6 No Married 60K No
◦ Object is also known as
record, point, case, sample, 7 Yes Divorced 220K No

entity, or instance 8 No Single 85K Yes


9 No Married 75K No
10 No Single 90K Yes
10
 An attribute is a property or characteristic of an
object that may vary; either from one object to
another or from one time to another.

 Ex: eye color varies from person to person,


while the temperature of an object varies over
time

 Note that eye color is a symbolic attribute with


a small number of possible values
{brown,black,blue, green,hazel, etc.}

 while temperature is a numerical attribute with


a potentially unlimited number of values.
 A measurement scale is a rule (function)
that associates a numerical or symbolic
value with an attribute of an object.

 The process of measurement is the


application of a measurement scale to
associate a value with a particular attribute of
a specific object
 The values used to represent an attribute may
have properties that are not properties of the
attribute itself, and vice versa

 Distinction between attributes and attribute


values
◦ Same attribute can be mapped to different attribute
values
 Example: height can be measured in feet or meters

◦ Different attributes can be mapped to the same set of


values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different
 ID has no limit but age has a maximum and minimum value
 The way you measure an attribute may not match the
attributes properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.

15 5
 The following properties (operations) of numbers
are typically used to describe attributes.

◦ Distinctness: = 
◦ Order: < >
◦ Differences are + -
meaningful :
◦ Ratios are * /
meaningful

◦ Nominal attribute: distinctness


◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & meaningful
differences
◦ Ratio attribute: all 4 properties/operations
 There are different types of attributes
◦ Nominal
 Examples: ID numbers, eye color, zip codes
◦ Ordinal
 Examples: rankings (e.g., place in competition),
grades, height {tall, medium, short}
◦ Interval
 Examples: calendar dates, temperatures in Celsius
or Fahrenheit.
◦ Ratio
 Examples: height, weight, length, time, counts
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, gender: {male, correlation, 2
Categorical
Qualitative

female} test

Ordinal Ordinal attribute hardness of minerals, median,


values also order {good, better, best}, percentiles, rank
objects. grades, street correlation, run
(<, >) numbers tests, sign tests
Interval For interval calendar dates, mean, standard
attributes, temperature in deviation,
differences between Celsius or Fahrenheit Pearson's
Quantitative
Numeric

values are correlation, t and


meaningful. (+, - ) F tests
Ratio For ratio variables, temperature in Kelvin, geometric mean,
both differences and monetary quantities, harmonic mean,
ratios are counts, age, mass, percent variation
meaningful. (*, /) length, current

Different Attributes types


Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

Permissible Transformation that do not change the


attributes meaning
 Discrete Attribute
◦ Has only a finite or countably infinite set of values
◦ Examples: zip codes, counts, or the set of words in a
collection of documents
◦ Often represented as integer variables.
◦ Note: binary attributes are a special case of discrete
attributes

 Continuous Attribute
◦ Has real numbers as attribute values
◦ Examples: temperature, height, or weight, count
◦ Practically, real values can only be measured and
represented using a finite number of digits.
◦ Continuous attributes are typically represented as
floating-point variables.
 Only presence (a non-zero attribute value) is
regarded as important.

 Consider a data set where each object is a


student and each attribute records whether or
not a student took a particular course at a
university.

 it is more meaningful and more efficient to focus


on the non-zero values.

 Binary attributes where only non-zero values are


important are called asymmetric binary
attributes
◦ Dimensionality (number of attributes)
 Less dimension may not lead to qualitative mining
results
 High dimensional data –Curse of dimensionality
 an important motivation in preprocessing the data is
dimensionality reduction.

◦ Sparsity
 only the non-zero values need to be stored and manipulated
which improves computation time and storage.

◦ Resolution
 obtain data at different levels of resolution, and often the
properties of the data are different at different resolution.
 Patterns should not be too fine or too coarse, it would not be
visible.
 Record
◦ Data Matrix
◦ Document Data
◦ Transaction Data
 Graph
◦ World Wide Web
◦ Molecular Structures
 Ordered
◦ Spatial Data
◦ Temporal Data
◦ Sequential Data
◦ Genetic Sequence Data
 Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of
as points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Such data set can be represented by an m by n


matrix, where there are m rows, one for each
object, and n columns, one for each attribute
 A sparse data matrix is a special case of a
data matrix in which the attributes are of the
same type and are asymmetric

 only non-zero values are important.

 Transaction data is an example of a sparse


data matrix that has only 0 1 entries.

 Another common example is document data.


 Each document becomes a „term‟ vector
◦ Each term is a component (attribute) of the vector
◦ The value of each component is the number of
times the corresponding term occurs in the
document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
 A special type of record data, where
◦ Each record (transaction) involves a set of items.
◦ For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
 Frequently convey important information. In
such cases, the data is often represented as a
graph.

 In particular, the data objects are mapped to


nodes of the graph, while the relationships
among objects are captured by the links
between objects and link properties, such as
direction and weight.
 Examples: Generic graph, a molecule, and
webpages

Benzene Molecule: C6H6


 For some types of data, the attributes have
relationships that involve order in time or
space

 Different types of ordered data are


◦ Sequential Data
◦ Sequence Data
◦ Time Series Data
◦ Spatial Data
Items/Eents
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Average Monthly
Temperature of
land and ocean

Spatio-Temporal Data
 Record-oriented techniques can be applied to
non-record data by extracting features from
data objects and using these features to
create a record corresponding to each object.

 Ex- chemical structure data


 Data mining focuses on

(1) the detection and correction of data quality


Problems - Data cleaning

(2) the use of robust algorithms that can tolerate


poor data quality.
 There may be problems due to human error,
limitations of measuring devices, or flaws in
the data collection process.

 measurement error refers to any problem


resulting from the measurement process

 For continuous attributes, the numerical


difference of the measured and true value is
called the error.
 The term data collection error refers to errors
such as :
1. missing values or even entire data objects
may be.
2. There may be spurious or duplicate objects

 Examples of data quality problems:


◦ Noise and outliers
◦ Missing values
◦ Duplicate data
◦ Wrong data
 Noise is the random component of a
measurement error.
 It may involve the distortion of a value or
the addition of spurious objects.
 Data errors may be the result of a more
deterministic phenomenon, such as a streak
in the same place on a set of photographs.

 Such deterministic distortions of the data are


often referred to as artifacts.
 The closeness of repeated measurements( of
the same quantity) to one another.

 A systematic variation of the measurements


from the quantity being measured.

 The closeness of measurements to the true


value of the quantity being measured.
 Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set

◦ Case 1: Outliers are


noise that interferes
with data analysis
◦ Case 2: Outliers are
the goal of our analysis
 Credit card fraud
 Intrusion detection
 Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


◦ Eliminate data objects or variables
◦ Estimate missing values
 Example: time series of temperature
◦ Ignore the missing value during analysis
 Inconsistent
 Duplicate
 Data set may include data objects that are
duplicates, or almost duplicates of one
another
◦ Major issue when merging data from
heterogeneous sources

 Examples:
◦ Same person with multiple email addresses
◦ Two distinct persons with same name
 Timeliness

 Relevance

 Knowledge about the Data


 Aggregation
 Sampling
 Dimensionality reduction
 Feature subset selection
 Feature creation
 Discretization and binarization
 Variable transformation
 Combining two or more attributes (or objects) into a
single attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years

– More “stable” data


 Aggregated data tends to have less variability
Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
 Sampling is the main technique employed for
data reduction.
◦ It is often used for both the preliminary investigation
of the data and the final data analysis.

 Statisticians often sample because obtaining


the entire set of data of interest is too
expensive or time consuming.

 Sampling is typically used in data mining


because processing the entire set of data of
interest is too expensive or time consuming.
 The key principle for effective sampling is the
following:

◦ Using a sample will work almost as well as using


the entire data set, if the sample is representative

◦ A sample is representative if it has approximately


the same properties (of interest) as the original set
of data
 Simple Random Sampling
◦ There is an equal probability of selecting any
particular item
◦ Sampling without replacement
 As each item is selected, it is removed from the
population
◦ Sampling with replacement
 Objects are not removed from the population as
they are selected for the sample.
 In sampling with replacement, the same object
can be picked up more than once
 Stratified sampling
◦ Split the data into several partitions; then draw
random equal or proportional no. of samples from
each partition.
 Sampling and loss of information.
 Determining the proper size of sample.
• Sample size plays a very important role.
• Large sample eliminate the advantage of sampling.
• In small size samples patterns may be missed and erroneous
patterns may be detected.

8000 points 2000 Points 500 Points


 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.
 The proper sample size can be difficult to
determine, so adaptive or progressive sampling
schemes are sometimes used.

 These approaches start with a small sample, and


then increase the sample size until a sample of
sufficient size has been obtained.

 This technique eliminates the need to determine


the correct sample size initially.

 it requires that there be a way to evaluate the


sample to judge if it is large enough.
 Data sets can have a large number of features.

 Dimensionality reduction is the transformation of data


from a high dimensional space to a low dimensional
space so that the low dimension retains some
meaningful properties of the original data.

 Purpose:
◦ Avoid curse of dimensionality
◦ Reduce amount of time and memory required by data
mining algorithms
◦ Allow data to be more easily visualized
◦ May help to eliminate irrelevant features or reduce noise
 The curse of dimensionality refers to the
phenomenon that many types of data analysis
become significantly harder as the dimensionality of
data increases.

 As the dimensionality of the data increases, the


data is becoming increasingly sparse in the space
that it occupies.

 This does not help data mining algorithms to get


the results

 Techniques
 Principal Components Analysis (PCA)
 Singular Value Decomposition(SVD) is a linear algebra
technique that is related to PCA and is also commonly used
for dimensionality reduction.
 (PCA) is a linear algebra technique for
continuous attributes that finds new
attributes (principal components) that
 (1) are linear combinations of the original
attributes,
 (2) are orthogonal (perpendicular) to each
other, and x2
 (3) capture the maximum amount of variation
in the data. e
 Another way to reduce dimensionality of data
 Redundant features
◦ Duplicate much or all of the information contained
in one or more other attributes
◦ Example: purchase price of a product and the
amount of sales tax paid
 Irrelevant features
◦ Contain no information that is useful for the data
mining task at hand
◦ Example: students' ID is often irrelevant to the task
of predicting students' GPA
 Embedded approaches
 Filter approaches
 Wrapper approaches
Embedded approaches:
 Feature selection occurs naturally as part of the data
mining algorithm.
 Specifically, during the operation of the data mining
algorithm, the algorithm itself decides which attributes to
use and which to ignore.
 Algorithms for building decision tree classifiers often
operate in this manner.
Filter approaches:
 Features are selected before the data mining algorithm is
run, using some approach that is independent of the data
mining task.
 For example, we might select sets of attributes whose
pairwise correlation is as low as possible.
Wrapper approaches:
 These methods use the target data mining algorithm as a
black box to find the best subset of attributes, in a way
similar to that of the ideal algorithm described above, but
typically without enumerating all possible subsets
 It is possible to encompass both the filter and
wrapper approaches within a common architecture.
 The feature selection process is viewed as consisting
of four parts: a measure for evaluating a subset, a
search strategy that controls the generation of a new
subset of features, a stopping criterion, and a
validation procedure.
 Filter methods and wrapper methods differ only in
the way in which they evaluate a subset of features.
 For a wrapper method, subset evaluation uses the
target data mining algorithm, while for a filter
approach, the evaluation technique is distinct from
the target data mining algorithm.
 For the filter approach, evaluation measures
attempt to predict how well the actual data
mining algorithm will perform on a given set
of attributes.
 For the wrapper approach, where evaluation
consists of actually running the target data
mining application, the subset evaluation
function is simply the criterion normally used
to measure the result of the data mining.
 The stopping criteria is usually based on one or more
conditions involving the following: the number of
iterations, whether the value of the subset evaluation
measure is optimal or exceeds a certain threshold,
whether a subset of a certain size has been obtained,
whether simultaneous size and evaluation criteria have
been achieved, and whether any improvement can be
achieved by the options available to the search strategy.

 Finally, once a subset of features has been selected, the


results of the target data mining algorithm on the selected
subset should be validated.
 A straightforward evaluation approach is to run the
algorithm with the full set of features and compare the full
results to results obtained using the subset of features.
 Another validation approach is to use a number of
different feature selection algorithms to obtain subsets of
features and then compare the results of running the data
mining algorithm on each subset.
 Feature weighting is an alternative to keeping
or eliminating features.
 More important features are assigned a
higher weight, while less important features
are given a lower weight.
 These weights are sometimes assigned based
on domain knowledge about the relative
importance of features.
 Alternatively, they may be determined
automatically.
 It is frequently possible to create, from the
original attributes, a new set of attributes
that captures the important information in a
data set much more effectively.

 Furthermore, the number of new attributes


can be smaller than the number of original
attributes, allowing us to reap all the
previously described benefits of
dimensionality reduction.
 Feature Extraction

 The creation of a new set of features from the


original raw data is known as feature extraction.
 Consider a set of photographs, where each
photograph is to be classified according to whether
or not it contains a human face.
 The raw data is a set of pixels, and as such, is not
suitable for many types of classification algorithms.
 However, if the data is processed to provide
higherlevel features, such as the presence or absence
of certain types of edges and areas that are highly
correlated with the presence of human faces
 Mapping the Data to a New Space
 A totally different view of the data can reveal
important and interesting features.
 Consider, for example, time series data, which often
contains periodic patterns.
 If there is only a single periodic pattern and not much
noise' then the pattern is easily detected.
 If, on the other hand, there are a number of periodic
patterns and a significant amount of noise is present,
then these patterns are hard to detect.
 Such patterns can, nonetheless, often be detected by
applying a Fourier transform to the time series in
order to change to a representation in which
frequency information is explicit
 Feature Construction
 Sometimes the features in the original data sets
have the necessary information, but it is not in a
form suitable for the data mining algorithm.
 In this situation, one or more new features
constructed out of the original features can be
more useful than the original features.
 a density feature constructed from the mass and
volume features, i.e., density : mass/uolume,
would most directly yield an accurate
classification.
 Discretization is typically applied to attributes
that are used in classification or association
analysis
 Divide the range of continuous attributes into
intervals.
 Interval label can be used to replace the actual
data values.
 Reducing the data size.
 Can be used for supervised and unsupervised
learning.
 Merge(Bottom up) and Split(top down) -2
methods
 Binning
◦ Top down split, unsupervised
 Histogram analysis
 Top down split, unsupervised
 Clustering analysis
 Top down split or bottom up merge, unsupervised
 Decision tree analysis
 Top down split, supervised
 Correlation analysis
 Bottom up merge, unsupervised
 Equal width partitioning
◦ Divides the range into N intervals of equal size.
◦ If A and B are the highest and lowest values of
intervals, the Width of the intervals will be
W=(A-B)/N.
◦ Outliers may predominate.
 Equal frequency partitioning
◦ Divides the range into N intervals, each containing
approximately the same no. of objects.
◦ Good data scaling.
Equal interval width approach used to obtain 4 values.
Equal frequency approach used to obtain 4 values.
K-means approach to obtain 4 values.
 Classification(decision tree analysis)
◦ Supervised – given class labels ex: benign vs
malignant
◦ Using entropy to determine split point.
◦ Top down –recursive split.
 Correlation analysis( chi square analysis)
◦ Supervised – use class information.
◦ Bottom up merge – find the best neighboring
intervals to merge.
◦ Use predefined condition to stop the recursive
merge.
 A simple technique to binarize a categorical attribute is
the following:
 If there are m categorical values, then uniquely assign
each original value to an integer in the interval [0,m - 1].
 If the attribute is ordinal, then order must be maintained
by the assignment.
 Next, convert each of these m integers to a binary number.
Since n = log2(m) binary digits are required to represent
these integers.
 Represent these binary numbers using n binary attributes.
 To illustrate, a categorical variable with 5 values {awful,
poor, OK, good, great} would require three binary
variables.
 Similarity measure
◦ Numerical measure of how alike two data objects are.
◦ Is higher when objects are more alike.
◦ Often falls in the range [0,1]

 Dissimilarity measure
◦ Numerical measure of how different two data objects are
◦ Lower when objects are more alike
◦ Minimum dissimilarity is often 0
◦ Upper limit varies

 Proximity refers to a similarity or dissimilarity


The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
 Euclidean Distance

where n is the number of dimensions (attributes)


and xk and yk are, respectively, the kth attributes
(components) or data objects x and y.
 Minkowski Distance is a generalization
of Euclidean Distance

Where r is a parameter, n is the


number of dimensions (attributes)
and xk and yk are, respectively, the kth
attributes (components) or data
objects x and y.
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
 r = 1. City block (Manhattan, taxicab, L1 norm)
distance.
 ex: hamming distance

 r = 2. Euclidean distance

 r  . “supremum” (Lmax norm, L norm) distance.


 Distances, such as the Euclidean distance,
have some well known properties.
1. d(x, y)  0 for all x and y
d(x, y) = 0 only if x = y. (Positive definiteness)
1. d(x, y) = d(y, x) for all x and y. (Symmetry)
2. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity)
between points (data objects), x and y.

 A distance that satisfies these properties is


a metric
 Non-metric Dissimilarities: Set Differences
◦ d(A,B): size(A- B) + size(B - A)

 Non-metric Dissimilarities: Time


 Similarities, also have some well known
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
(0 < s <1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points


(data objects), x and y.
 Similarity measures between objects that contain only
binary attributes are called similarity coefficients, and
typically have values between 0 and 1.

Let x and y be two objects that consist of n binary attributes.

Simple Matching Coefficient


 Suppose that x and y are data objects that
represent two rows (two transactions) of a
transaction matrix .
 If each asymmetric binary attribute corresponds to
an item in a store, then a 1 indicates that the item
was purchased, while a 0 indicates that the product
was not purchased

 Variation of Jaccard for continuous or count
attributes
◦ Reduces to Jaccard for binary attributes
Scatter plots
showing the
similarity from
–1 to 1.
 x = (-3, -2, -1, 0, 1, 2, 3)
 y = (9, 4, 1, 0, 1, 4, 9)

yi = xi 2

 mean(x) = 0, mean(y) = 4
 std(x) = 2.16, std(y) = 3.74

 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0
 A generalization of Euclidean distance
 is useful when attributes are correlated, have
different ranges of values (different
variances), and the distribution of the data is
approximately Gaussian
 Domain of application
◦ Similarity measures tend to be specific to the type of
attribute and data
◦ Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures

 However, one can talk about various properties


that you would like a proximity measure to have
◦ Symmetry is a common one
◦ Tolerance to noise and outliers is another
◦ Ability to find more types of patterns?
◦ Many others possible

 The measure must be applicable to the data and


produce results that agree with domain
knowledge
 For
◦ a variable (event), X,
◦ with n possible values (outcomes), x1, x2 …, xn
◦ each outcome having probability, p1, p2 …, pn
◦ the entropy of X , H(X), is given by
𝑛

𝐻 𝑋 =− 𝑝𝑖 log 2 𝑝𝑖
𝑖=1

 Entropy is between 0 and log2n and is measured in


bits
◦ Thus, entropy is a measure of how many bits it takes to
represent an observation of X on average
 For a coin with probability p of heads and
probability q = 1 – p of tails
𝐻 = −𝑝 log 2 𝑝 − 𝑞 log 2 𝑞
◦ For p= 0.5, q = 0.5 (fair coin) H = 1
◦ For p = 1 or q = 1, H = 0

 What is the entropy of a fair four-sided die?


Hair Color Count p -plog2p
Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161
Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540

Maximum entropy is log25 = 2.3219


 Suppose we have
◦ a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
◦ where there are n different possible values
◦ And the number of observation in the ith category is
mi
◦ Then, for this sample
𝑛
𝑚𝑖 𝑚𝑖
𝐻 𝑋 =− log 2
𝑚 𝑚
𝑖=1

 For continuous data, the calculation is harder


 Information one variable provides about another

Formally, 𝐼 𝑋, 𝑌 = 𝐻 𝑋 + 𝐻 𝑌 − 𝐻(𝑋, 𝑌), where

H(X,Y) is the joint entropy of X and Y,

𝐻 𝑋, 𝑌 = − 𝑝𝑖𝑗log 2 𝑝𝑖𝑗
𝑖 𝑗
Where pij is the probability that the ith value of X and the jth value of
Y occur together

 For discrete variables, this is easy to compute

 Maximum mutual information for discrete variables is


log2(min( nX, nY ), where nX (nY) is the number of values of X (Y)
Student Count p -plog2p Student Grade Count p -plog2p
Status Status
Undergra 45 0.45 0.5184
Undergra A 5 0.05 0.2161
d
d
Grad 55 0.55 0.4744
Undergra B 30 0.30 0.5211
Total 100 1.00 0.9928 d
Undergra C 10 0.10 0.3322
d
Grade Count p -plog2p
Grad A 30 0.30 0.5211
A 35 0.35 0.5301
Grad B 20 0.20 0.4644
B 50 0.50 0.5000
C 15 0.15 0.4105 Grad C 5 0.05 0.2161

Total 100 1.00 1.4406 Total 100 1.00 2.2710

Mutual information of Student Status and Grade = 0.9928 + 1.4406


- 2.2710 = 0.1624
 Applies mutual information to two continuous
variables

 Consider the possible binnings of the variables


into discrete categories
◦ nX × nY ≤ N0.6 where
 nX is the number of values of X
 nY is the number of values of Y
 N is the number of samples (observations, data objects)

 Compute the mutual information


◦ Normalized by log2(min( nX, nY )

 Take the highest value

You might also like