0% found this document useful (0 votes)
16 views

DM Module 2

Uploaded by

hibariseiji
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

DM Module 2

Uploaded by

hibariseiji
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

MODULE 2
Data warehouse implementation & Data mining

Data Warehouse Implementation:


Data warehouses contain huge volumes of data. OLAP servers demand that decision support
queries be answered in the order of seconds. Therefore, it is crucial for data warehouse systems
to support highly efficient cube computation techniques, access methods, and query processing
techniques.
Efficient Data Cube Computation: An Overview:
At the core of multidimensional data analysis is the efficient computation of aggregations across
many sets of dimensions. In SQL terms, these aggregations are referred to as group-by’s. Each
group-by can be represented by a cuboid, where the set of group-by’s forms a lattice of cuboids
defining a data cube.
The compute cube Operator and the Curse of Dimensionality:
• Data cube is a lattice of cuboids. Suppose that you want to create a data cube for AllElectronics
sales that contains the following: city, item, year, and sales in dollars. You want to be able to
analyze the data, with queries such as the following:
o “Compute the sum of sales, grouping by city and item.”
o “Compute the sum of sales, grouping by city.”
o “Compute the sum of sales, grouping by item.”
• the total number of cuboids, or group-by’s, that can be computed for this data cube? Taking the
three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars
as the measure, the total number of cuboids, or groupby’s, that can be computed for this data
cube is 23 =8.
• The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item,
year), (city), (item), (year), ()}, where () means that the group-by is empty (i.e., the dimensions
are not grouped). These group-by’s form a lattice of cuboids for the data cube, as shown in the
figure below.

Dept of CSE, Vemana I.T Page 1 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• The base cuboid contains all three dimensions, city, item, and year. It can return the total sales
for any combination of the three dimensions. The apex cuboid, or 0-D cuboid, refers to the case
where the group-by is empty. It contains the total sum of all sales.
• The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the
most generalized (least specific) of the cuboids, and is often denoted as all.
• An SQL query containing no group-by(e.g.,“compute the sum of total sales”) is a zero
dimensional operation. An SQL query containing one group-by (e.g., “compute the sum of sales,
group-bycity”) is a one-dimensional operation. A cube operator on n dimensions is equivalent
to a collection of group-by statements, one for each subset of the n dimensions. Therefore, the
cube operator is the n-dimensional generalization of the group-by operator. Similar to the SQL
syntax, the data cube in Example could be defined as:
define cube sales cube [city, item, year]: sum(sales in dollars)
• For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid. A
statement such as compute cube sales cube would explicitly instruct the system to compute
the sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the empty
subset.

Dept of CSE, Vemana I.T Page 2 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

curse of dimensionality:
• A major challenge related to this precomputation, however, is that the required storage space
may explode if all the cuboids in a data cube are precomputed, especially when the cube has
many dimensions.
• The storage requirements are even more excessive when many of the dimensions have
associated concept hierarchies, each with multiple levels. This problem is referred to as the
curse of dimensionality.
Example: time is usually explored not at only one conceptual level (e.g., year), but rather at multiple
conceptual levels such as in the hierarchy “day <month < quarter < year.” For an n-dimensional
data cube, the total number of cuboids that can be generated (including the cuboids generated by
climbing up the hierarchies along each dimension) is

where Li is the number of levels associated with dimension i. One is added to Li to include the
virtual top level, all.

Partial Materialization: Selected Computation of Cuboids:


There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “nonbase” cuboids. This leads to computing
expensive multidimensional aggregates on-the-fly, which can be extremely slow.
2. Full materialization: Precompute all of the cuboids. The resulting lattice of computed cuboids
is referred to as the full cube. This choice typically requires huge amounts of memory space in
order to store all of the precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of possible
cuboids. Alternatively, we may compute a subset of the cube, which contains only those cells that
satisfy some user-specified criterion, such as where the tuple count of each cell is above some
threshold. We will use the term sub cube to refer to the latter case, where only some of the cells
may be precomputed for various cuboids. Partial materialization represents an interesting trade-
off between storage space and response time.

Dept of CSE, Vemana I.T Page 3 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• The partial materialization of cuboids or subcubes should consider three factors:


(1) identify the subset of cuboids or subcubes to materialize;
(2) exploit the materialized cuboids or subcubes during query processing; and
(3)efficiently update the materialized cuboids or subcubes during load and refresh.

• The selection of the subset of cuboids or subcubes to materialize should take into account the
queries in the workload, their frequencies, and their accessing costs. In addition, it should consider
workload characteristics, the cost for incremental updates, and the total storage requirements.
• A popular approach is to materialize the cuboids set on which other frequently referenced
cuboids are based. Alternatively, we can compute an iceberg cube, which is a data cube that
stores only those cube cells with an aggregate value (e.g., count) that is above some minimum
support threshold.
• Another common strategy is to materialize a shell cube. This involves precomputing the
cuboids for only a small number of dimensions (e.g., three to five) of a data cube.
• Once the selected cuboids have been materialized, it is important to take advantage of Them
during query processing. This involves several issues, such as
o how to determine the relevant cuboid(s) from among the candidate materialized
cuboids,
o how to use available index structures on the materialized cuboids, and
o how to transform the OLAP operations onto the selected cuboid(s).
Finally, during load and refresh, the materialized cuboids should be updated efficiently.

Indexing OLAP Data: Bitmap Index and Join Index


The bitmap indexing method is popular in OLAP products because it allows quick searching
in data cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the
bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s
domain. If a given attribute’s domain consists of n values, then n bits are needed for each entry in
the bitmap index (i.e., there are n bit vectors). If the attribute has the value v for a given row in the
data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap
index. All other bits for that row are set to 0.
• Index on a particular column

Dept of CSE, Vemana I.T Page 4 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• Each value in the column has a bit vector: bit-op is fast


• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed column
• not suitable for high cardinality domains

The join indexing method gained popularity from its use in relational database query
processing. Traditional indexing maps the value in a given column to a list of rows having that
value. In contrast, join indexing registers the joinable rows of two relations from a relational
database.
• In data warehouses, join index relates the values of the dimensions of a start schema to
rows in the fact table.
o E.g. fact table: Sales and two dimensions city and product
▪ A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city
o Join indices can span multiple dimensions

Dept of CSE, Vemana I.T Page 5 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Efficient Processing of OLAP Queries


The purpose of materializing cuboids and constructing OLAP index structures is to speed up query
processing in data cubes. Given materialized views, query processing should proceed as follows:

Dept of CSE, Vemana I.T Page 6 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

1. Determine which operations should be performed on the available cuboids: This involves
transforming any selection, projection, roll-up (group-by), and drill-down operations specified in
the query into corresponding SQL and/or OLAP operations.
For example, slicing and dicing a data cube may correspond to selection and/or projection
operations on a materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be applied:
This involves identifying all of the materialized cuboids that may potentially be used to answer the
query, pruning the set using knowledge of “dominance” relationships among the cuboids,
estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with
the least cost.

“Which of these four cuboids should be selected to process the query?”


Finer-granularity data cannot be generated from coarser-granularity data. Therefore, cuboid 2
cannot be used because country is a more general concept than province or state. Cuboids 1, 3, and
4 can be used to process the query.
“How would the costs of each cuboid compare if used to process the query?”
▪ cuboid 1 would cost the most because both item name and city are at a lower level than the
brand and province or state concepts specified in the query.

▪ cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the
query. However, if efficient indices are available for cuboid 4, then cuboid 4 may be a better
choice.

Dept of CSE, Vemana I.T Page 7 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP


1. Relational OLAP (ROLAP):
▪ ROLAP works directly with relational databases. The base data and the dimension tables
are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.
▪ This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action
of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
▪ ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
▪ ROLAP tools feature the ability to ask any question because the methodology does not limit
to the contents of a cube. ROLAP also has the ability to drill down to the lowest level of
detail in the database.
2. Multidimensional OLAP (MOLAP):
▪ MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
▪ MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a
relational database. Therefore it requires the pre-computation and storage of information
in the cube - the operation known as processing.
▪ MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
▪ The data cube contains all the possible answers to a given range of questions.
▪ MOLAP tools have a very fast response time and the ability to quickly write back data into
the data set.
3. Hybrid OLAP (HOLAP):
▪ There is no clear agreement across the industry as to what constitutes Hybrid OLAP, except
that a database will divide data between relational and specialized storage.
▪ For example, for some vendors, a HOLAP database will use relational tables to hold the
larger quantities of detailed data, and use specialized storage for at least some aspects of
the smaller quantities of more-aggregate or less-detailed data.
▪ HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of
both approaches.
▪ HOLAP tools can utilize both pre-calculated cubes and relational data sources.

Dept of CSE, Vemana I.T Page 8 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Data Mining:
Datamining is a technology that blends traditional data analysis methods with
sophisticated algorithm for processing of large volume of data.
• It has also opened for existing opportunities foe exploring and analyzing new type of data
and for analyzing different types of data in new ways.
• Datamining is a process of automatically discovering useful information in large data
repositories.
Applications of data mining

• Banking: loan/credit card approval :- Given a database of 100,000 names, which


persons are the least likely to default on their credit cards?
• Customer relationship management: – Which of my customers are likely to be the
most loyal, and which are most likely to leave for a competitor?
• Targeted marketing: – identify likely responders to promotions
• Fraud detection:- telecommunications, financial transactions from an online stream of
event identify fraudulent events
• Manufacturing and production: – automatically adjust knobs when process parameter
changes

Dept of CSE, Vemana I.T Page 9 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• Medicine: disease outcome, effectiveness of treatments – analyze patient disease


history: find relationship between diseases
• Molecular/Pharmaceutical: identify new drugs
• Scientific data analysis: – identify new galaxies by searching for sub clusters
• Web site/store design and promotion: – find affinity of visitor to pages and modify
layout
What is not Data Mining?

• Find all credit applicants with last name of Smith.


• Identify customers who have purchased more than $10,000 in the last month.
• Find all customers who have purchased milk
• Looking up individual records using a database management system
• Look up phone number in phone directory
• finding particular Web pages via a query to an Internet search engine

What is Data Mining?

• Find all credit applicants who are poor credit risks. (classification)
• Identify customers with similar buying habits. (Clustering)
• Find all items which are frequently purchased with milk. (association rules)
• Discover groups of similar documents on the Web
• Certain names are more popular in certain locations

Data Mining and Knowledge Discovery:


Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall
process of converting raw data into useful information, as shown in Figure 1. This process
consists of a series of transformation steps, from data preprocessing to postprocessing of data
mining results.

Dept of CSE, Vemana I.T Page 10 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• The input data can be stored in a variety of formats (flat files, spreadsheets, or relational
tables) and may reside in a centralized data repository or be distributed across multiple
sites.
• The purpose of preprocessing is to transform the raw input data into an appropriate format
for subsequent analysis.
• The steps involved in data preprocessing include
▪ fusing data from multiple sources,
▪ cleaning data to remove noise and duplicate observations, and
▪ selecting records and features that are relevant to the data mining task, this
is the most time-consuming step in the overall knowledge discovery process.
• Postprocessing step that ensures that only valid and useful results are incorporated into
the decision support system.

Motivating Challenges:
• Scalability: Advances in data generation and collection, data sets with sizes of gigabytes,
terabytes, or even petabytes are becoming common. If data mining algorithms are to handle
these massive data sets, then they must be scalable.
• High Dimensionality: It is now common to encounter data sets with hundreds or
thousands of attributes Data sets with temporal or spatial components also tend to have
high dimensionality. For example, consider a data set that contains measurements of
temperature at various locations the computational complexity increases rapidly as the
dimensionality (the number of features) increases.

Dept of CSE, Vemana I.T Page 11 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• Heterogeneous and Complex Data: Traditional data analysis methods often deal with
data sets containing attributes of the same type, either continuous or categorical. As the
role of data mining in business, science, medicine, and other flelds has grown, so has the
need for techniques that can handle heterogeneous attributes.
• Data ownership and Distribution: Sometimes, the data needed for an analysis is not
stored in one location or owned by one organization. Instead, the data is geographically
distributed among resources belonging to multiple entities. This requires the development
of distributed data mining techniques. Among the key challenges faced by distributed data
mining algorithms include (1) how to reduce the amount of communication needed to
perform the distributed computatior, (2) how to effectively consolidate the data mining
results obtained from multiple sources, and (3) how to address data security issues.
• Non-traditional Analysis: The traditional statistical approach is based on a hypothesize-
and-test paradigm. Current data analysis tasks often require the generation and evaluation
of thousands of hypotheses, and consequently, the development of some data mining
techniques has been motivated and it is non-traditional approach.

Data Mining Tasks:


Data mining tasks are generally divided into two major categories:
• Predictive tasks. The objective of these tasks is to predict the value of a particular attribute
based on the values of other attributes. The attribute to be predicted is commonly known
as the target or dependent variable, while the attributes used for making the prediction are
known as the explanatory or independent variables.
• Descriptive tasks. Here, the objective is to derive patterns (correlations, trends, clusters,
trajectories, and anomalies) that summarize the underlying relationships in data.
Descriptive data mining tasks are often exploratory in nature and frequently require
postprocessing techniques to validate and explain the results.

Figure 1.3 illustrates four of the core data mining tasks

Dept of CSE, Vemana I.T Page 12 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Predictive modeling refers to the task of building a model for the target variable as a function
of the explanatory variables.
There are two types of predictive modeling tasks:
▪ classification, which is used for discrete target variables, and
▪ regression, which is used for continuous target variables.
The goal of both tasks is to learn a model that minimizes the error between the predicted and true
values of the target variable.
For example, predicting whether a Web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued.
On the other hand, forecasting the future price of a stock is a regression task because price is a
continuous-valued attribute.

Dept of CSE, Vemana I.T Page 13 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• Association analysis: is used to discover patterns that describe strongly associated


features in the data. The discovered patterns are typically represented in the form of
implication rules or feature subsets. Because of the exponential size of its search space, the
goal of association analysis is to extract the most interesting patterns in an efficient
manner.
Useful applications of association analysis include finding groups of genes that have related
functionality, identifying Web pages that are accessed together, or understanding the
relationships between different elements of Earth's climate system.
Example (Market Basket Analysis). The transactions shown in Table 1.1 illustrate point-of-sale
data collected at the checkout counters of a grocery store. Association analysis can be applied to
find items that are frequently bought together by customers. For example, we may discover the
rule {Diapers} ----->{Milk}, which suggests that customers who buy diapers also tend to buy milk.
This type of rule can be used to identify potential cross-selling opportunities among related items.

• Cluster analysis: seeks to find groups of closely related observations so that


observations that belong to the same cluster are more similar to each other than
observations that belong to other clusters.
Example (Document Clustering). The collection of news articles shown in Table 1.2 can be
grouped based on their respective topics. Each article is represented as a set of word-frequency
pairs (w, c), where w is a word and c is the number of times the word appears in the article. There
are two natural clusters in the data set. The first cluster consists of the first four articles, which

Dept of CSE, Vemana I.T Page 14 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

correspond to news about the economy, while the second cluster contains the last four articles,
which correspond to news about health care. A good clustering algorithm should be able to identify
these two clusters based on the similarity between words that appear in the articles.

• Anomaly detection: is the task of identifying observations whose characteristics are


significantly different from the rest of the data. Such observations are known as anomalies
or outliers. The goal of an anomaly detection algorithm is to discover the real anomalies
and avoid falsely labeling normal objects as anomalous. In other words, a good anomaly
detector must have a high detection rate and a low false alarm rate. Applications of anomaly
detection include the detection of fraud, network intrusions, unusual patterns of disease,
and ecosystem disturbances.
Example (Credit Card fraud Detection). A credit card company records the transactions made
by every credit card holder, along with personal information such as credit limit, age, annual
income, and address. since the number of fraudulent cases is relatively small compared to the
number of legitimate transactions, anomaly detection techniques can be applied to build a profile
of legitimate transactions for the users. When a new transaction arrives, it is compared against the
profile of the user. If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.

What is Data
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object

Dept of CSE, Vemana I.T Page 15 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

o Examples: eye color of a person, temperature, etc.


o Attribute is also known as variable, field, characteristic, or feature
• A collection of attributes describe an object
o Object is also known as record, point, case, sample, entity, or instance

Types of Data:
• A data set can often be viewed as a collection of data objects.
• Other names for a data object are record, point, vector, pattern, event, case, sample,
observation, or entity.
• Data objects are described by a number of attributes that capture the basic characteristics
of an object, such as the mass of a physical object or the time at which an event occurred. Other
names for an attribute are variable, characteristic field, feature, or dimension

Data can be broadly classified into 2 types:


• Qualitative
• Quantitative (Numeric)

Qualitative Data: Arise when the observations fall into separate distinct categories.
Example: colors of eyes: brown, black, hezal, green

Dept of CSE, Vemana I.T Page 16 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Exam results: Pass, Fail


Such data are inherently “discrete” in that their finite number of possible categories into which
each observation may fall.

Quantitative Data (Numeric Data): Arise when observations are counts or measurements.
The data are said to be “discrete” if the measurements are integers.
Example: number of people in the house.
The data are said to be “continuous” if the measurements an take any value, usually within some
range.
Example: weight

Attributes and Measurement:


An attribute is a property or characteristic of an object that may vary; either from one object to
another or from one time to another.
Measurement scale is a rule (function) that associates a numerical or symbolic value with an
attribute of an object.
Example: We count the no. of chairs in a room to see if there will be a enough to seat all the people
who are coming to a meeting.

Type of an Attribute:
The properties of an attribute need not be the same as the properties of the values used to measure
it. In other words, the values used to represent an attribute may have properties that are not
properties of the attribute itself, and vice versa.

The Different Types of Attributes:


A useful (and simple) way to specify the type of an attribute is to identify the properties of numbers
that correspond to underlying properties of the attribute. For example, an attribute such as length
has many of the properties of numbers.
The following properties(operations) of numbers are typically used to describe attributes.
1. Distinctness = and ≠
2. Order <, >, ≤ and ≥
3. Addition * and -

Dept of CSE, Vemana I.T Page 17 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

4. Multiplication x and /
Given these properties, we can define four types of attributes: nominal, ordinal, interval, and ratio.

Nominal Attribute
Nominal means “relating to names.” The values of a nominal attribute are symbols or names of
things. Each value represents some kind of category, code, or state, and so nominal attributes are
also referred to as categorical. The values do not have any meaningful order. In computer science,
the values are also known as enumerations.

Ordinal attribute
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is not known.

Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-
scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify the difference between values.

Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value.
In addition, the values are ordered, and we can also compute the difference between values, as
well as the mean, median, and mode.

Dept of CSE, Vemana I.T Page 18 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Dept of CSE, Vemana I.T Page 19 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Describing Attribute by the number of values


An independent way of distinguishing between attributes is by the number of values they can
take
Discrete Attribute:
• Has only a finite or countable infinite set of values
Examples: zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
Continuous Attribute:
• Has real numbers as attribute values
Examples: temperature, height, or weight.

Dept of CSE, Vemana I.T Page 20 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• Practically, real values can only be measured and represented using a finite number of
digits.
• Continuous attributes are typically represented as floating point variables.

Types of Data Sets:


General Characteristics of Data Sets:
Three characteristics that apply to many data sets and have a significant impact on the data mining
techniques that are used: dimensionality, sparsity, and resolution.
Dimensionality: Refers to how many attributes a dataset has.
Example: Health care data is notation for having vast number of variables (Blood pleasure,
Weight, etc)
Sparsity: Is one which a relatively high percentage of variables cells do not contain actual data
Example: New variable dimensioned by MONTH for which you don’t have data for past
month.
Resolution: Frequently possible to obtain data at different levels of resolution, and often the
properties of the data are different at different resolutions.
Example: surface of the Earth seems very uneven at a resolution of a few meters, but is
relatively smooth at a resolution of tens of kilometers.
The patterns in the data also depend on the level of resolution. If the resolution is too fine, a pattern
may not be visible or may be buried in noise; if the resolution is too coarse, the pattern may
disappear.

Types of Data sets:


Record Data:
• Collection of records, each of which consists of a fixed set of data fields (attributes).
• For the most basic form of record data, there is no explicit relationship among records
every record (object) has the same set of attributes.
• Record data is usually stored either in flat files or in relational databases
Example:
Transaction Data- Each record (transaction) involves a set of items.
Data Matrix – Collection of Data with same fixed set of numeric attributes.

Dept of CSE, Vemana I.T Page 21 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Document Data – Term Matrix-where each term is component (attribute) of the


vector.

Graph-Based Data:
A graph can sometimes be a convenient and powerful representation for data. We consider two
specific cases: (1) the graph captures relationships among data objects and (2) the data objects
themselves are represented as graphs

Data with Relationships among Objects The relationships among objects frequently convey
important information. In such cases, the data is often represented as a graph. In particular, the
data objects are mapped to nodes of the graph, while the relationships among objects are captured
by the links between objects and link properties, such as direction and weight. Consider Web pages
on the World Wide Web, which contain both text and links to other pages.

Dept of CSE, Vemana I.T Page 22 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Data with Objects That Are Graphs If objects have structure, that is, the objects contain
subobjects that have relationships, then such objects are frequently represented as graphs. For
example, the structure of chemical compounds can be represented by a graph, where the nodes
are atoms and the links between nodes are chemical bonds.

Ordered Data
For some types of data, the attributes have relationships that involve order in time or space.

Different types of ordered data are:


Sequential Data:
Also referred to as temporal data, can be thought of as an extension of record
data, where each record has a time associated with it.
Example: A retail transaction data set that also stores the time at which the transaction took place.

Sequence Data:
Sequence data consists of a data set that is a sequence of individual entities, such as a sequence of
words or letters. It is quite similar to sequential data, except that there are no time stamps; instead,
there are positions in an ordered sequence.

Dept of CSE, Vemana I.T Page 23 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Example: the genetic information of plants and animals can be represented in the form of
sequences of nucleotides that are known as genes.

Time Series Data:


Time series data is a special type of sequential data in which each record is a time series, i.e., a
series of measurements taken over time.
Example: A financial data set might contain objects that are time series of the daily prices of
various stocks.
Example: consider a time series of the average monthly temperature for a city during the years
1982 to 1994

Spatial Data:
Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.
Example: Weather data (precipitation, temperature, pressure) that is collected for a variety of
geographical locations.

Data Quality:
• Data mining applications are often applied to data that was collected for another purpose,
or for future, but unspecified applications.
• For that reason ,data mining cannot usually take advantage of the significant benefits of
"addressing quality issues at the source."
• Data mining focuses on
(1) the detection and correction of data quality problems (called data cleaning.)
(2) the use of algorithms that can tolerate poor data quality.

Measurement and Data Collection Issues:


It is unrealistic to expect that data will be perfect. There may be problems due to human error,
limitations of measuring devices, or flaws in the data collection process. Values or even entire data
objects may be missing. In other cases, there may be spurious or duplicate objects.

Dept of CSE, Vemana I.T Page 24 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Measurement and Data Collection Errors:


The term measurement error refers to any problem resulting from the measurement process. A
common problem is that the value recorded differs from the true value to some extent. For
continuous attributes, the numerical difference of the measured and true value is called the error.
The term data collection error refers to errors such as omitting data objects or attribute values, or
inappropriately including a data object. For example, a study of animals of a certain species might
include animals of a related species that are similar in appearance to the species of interest. Both
measurement errors and data collection errors can be either systematic or random.

Noise and Artifacts:


Noise is the random component of a measurement error. It may involve the distortion of a value
or the addition of spurious objects.

Data errors may be the result of a more deterministic phenomenon, such as a streak in the same
place on a set of photographs. Such deterministic distortions of the data are often referred to as
artifacts.

Precision, Bias, and Accuracy:


Precision: The closeness of repeated measurements (of the same quantity) to one another
Bias: A systematic quantity being measured
Accuracy: The closeness of measurements to the true value of the quantity being measured

Precision is often measured by the standard deviation of a set of values, while bias is measured by
taking the difference between the mean of the set of values and the known value of the quantity
being measured It is common to use the more general term, accuracy, to refer to the degree of
measurement error in data.

Dept of CSE, Vemana I.T Page 25 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Outliers:
Outliers are either (1) data objects that, in some sense, have characteristics that are different from
most of the other data objects in the data set, or (2) values of an attribute that are unusual with
respect to the typical values for that attribute. Alternatively, we can speak of anomalous objects or
values. it is important to distinguish between the notions of noise and outliers. Outliers can be
legitimate data objects or values. Thus, unlike noise, outliers may sometimes be of interest. In
fraud and network intrusion detection, for example, the goal is to find unusual objects or events
from among a large number of normal ones.

Missing Values:
It is not unusual for an object to be missing one or more attribute values. In some cases, the
information was not collected; e.g., some people decline to give their age or weight. In other cases,
some attributes are not applicable to all objects; e.g., often, forms have conditional parts that are
filled out only when a person answers a previous question in a certain way, but for simplicity, all
fields are stored Handling missing values
• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values (weighted by their probabilities)

Eliminate Data Objects or Attributes:


A simple and effective strategy is to eliminate objects with missing values. However, even a
partially specified data object contains some information, and if many objects have missing values,
then a reliable analysis can be difficult or impossible

Estimate Missing Values:


Sometimes missing data can be reliably estimated. For example, consider a time series that
changes in a reasonably smooth fashion, but has a few, widely scattered missing values. In such
cases, the missing values can be estimated (interpolated) by using the remaining values.

Dept of CSE, Vemana I.T Page 26 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Ignore the Missing Value:


During Analysis Many data mining approaches can be modified to ignore missing values. For
example, suppose that objects are being clustered and the similarity between pairs of data objects
needs to be calculated. If one or both objects of a pair have missing values for some attributes, then
the similarity can be calculated by using only the attributes that do not have missing values.

Inconsistent Values:
Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city It may be that the individual
entering this information transposed two digits, or perhaps a digit was misread when the
information was scanned from a handwritten form. Some types of inconsistences are easy to
detect. For instance, a person's height should not be negative. In other cases, it can be necessary
to consult an external source of information. For example, when an insurance company processes
claims for reimbursement, it checks the names and addresses on the reimbursement forms against
a database of its customers. Once an inconsistency has been detected, it is sometimes possible to
correct the data. A product code may have "check" digits, or it may be possible to double-check a
product code against a list of known product codes, and then correct the code if it is incorrect, but
close to a known code. The correction of an inconsistency requires additional or redundant
information.

Duplicate Data:
A data set may include data objects that are duplicates, or almost duplicates, of one another. To
detect and eliminate such duplicates, two main issues must be addressed. First, if there are two
objects that actually represent a single object, then the values of corresponding attributes may
differ, and these inconsistent values must be resolved. Second, care needs to be taken to avoid
accidentally combining data objects that are similar, but not duplicates, such as two distinct people
with identical names. The term deduplication is often used to refer to the process of dealing with
these issues. In some cases, two or more objects are identical with respect to the attributes
measured by the database, but they still represent different objects. Here, the duplicates are
legitimate, but may still cause problems for some algorithms if the possibility of identical objects
is not specifically accounted for in their design.

Dept of CSE, Vemana I.T Page 27 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Data Preprocessing:
• Is a datamining technique that involves transforming raw data into an understandable
format. Real world data is often incomplete, inconsistent and for lacking in certain behavior
and trends. And likely to contain many errors.
• Data preprocessing is a process method of resolving such issues.
• Some of the most important approaches for data preprocessing are:
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation

These approaches fall into two categories: selecting data objects and attributes for the analysis or
creating/changing the attributes.

Aggregation:
• Aggregation is combining of two or more objects into a single object. Data set consisting
of transactions (data objects) recording the daily sales of products in various store
locations (Minneapolis, Chicago, Paris, ...) for different days over the course of a year.
• Example: Merging daily sales data figures to obtain monthly sales figures.
• There are several motivations for aggregation:
o First, the smaller data sets resulting from data reduction require less memory and
processing time, and hence, aggregation may permit the use of more expensive data
mining algorithms.
o Second, aggregation can act as a change of scope or scale by providing a high-level
view of the data instead of a low-level view.

Dept of CSE, Vemana I.T Page 28 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

o Finally, the behavior of groups of objects or attributes is often more stable


than that of individual objects or attributes.
Disadvantage of aggregation: The potential loss of interesting details.
In the store example aggregating over months loses information about which day of the week has
the highest sales.

Sampling:
• Sampling is a commonly used approach for selecting a subset of the data objects to be
analyzed. In statistics, it has long been used for both the preliminary investigation of the
data and the final data analysis.
• The key principle for effective sampling is the following: Using a sample will work almost
as well as using the entire data set if the sample is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
Sampling Approaches:
There are two variations on random sampling (and other sampling techniques as well):
(1) sampling without replacement-as each item is selected, it is removed from the set of
all objects that together constitute the population, and
(2) sampling with replacement-objects are not removed from the population as they are
selected for the sample.
In sampling with replacement, the same object can be picked more than once. The samples
produced by the two methods are not much different when samples are relatively small compared
to the data set size, but sampling with replacement is simpler to analyze since the probability of
selecting any object remains constant during the sampling process.

Stratified sampling: which starts with prespecified groups of objects, is such an approach. In the
simplest version, equal numbers of objects are drawn from each group even though the groups are
of different sizes. In another variation, the number of objects drawn from each group is
proportional to the size of that group.

Dept of CSE, Vemana I.T Page 29 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained.

Dimensionality Reduction:
• Data sets can have a large number of features. There are a variety of benefits to
dimensionality reduction.
• A key benefit is that many data mining algorithms work better if the dimensionality the
number of attributes in the data-is lower.
• Another benefit is that a reduction of dimensionality can lead to a more understandable
model because the model may involve fewer attributes.
• Also, dimensionality reduction may allow the data to be more easily visualized. Even if
dimensionality reduction doesn't reduce the data to two or three dimensions, data is often
visualized by looking at pairs or triplets of attributes, and the number of such combinations
is greatly reduced.
• The amount of time and memory required by the data mining algorithm is reduced with a
reduction in dimensionality.
• The term dimensionality reduction is often reserved for those techniques that reduce the
dimensionality of a data set by creating new attributes that are a combination of the old
attributes. The reduction of dimensionality by selecting new attributes that are a subset of
the old is known as feature subset selection or feature selection.

The Curse of Dimensionality:


The curse of dimensionality refers to the phenomenon that many types of data analysis become
significantly harder as the dimensionality of the data increases. Specifically, as dimensionality
increases, the data becomes increasingly sparse in the space that it occupies.

Dept of CSE, Vemana I.T Page 30 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Feature Subset Selection


• Another way to reduce the dimensionality is to use only a subset of the features.
• This approach would lose information, this is not the case if redundant and irrelevant
features are present. Redundant features duplicate much or all of the information
contained in one or more other attributes.
• For example, the purchase price of a product and the amount of sales tax paid contain much
of the same information. Irrelevant features contain almost no useful information for the
data mining task at hand. For instance, students' ID numbers are irrelevant to the task of
predicting students' grade point averages.
• Redundant and irrelevant features can reduce classification accuracy and the quality of the
clusters that are found.
There are three standard approaches to feature selection: embedded, filter, and wrapper.
Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm.
Algorithm itself decides which attributes to use and which to ignore. These algorithms are used
for building decision tree classifier.

Filter approaches: Features are selected before the data mining algorithm is run, using some
approach that is independent of the data mining task.
For example, we might select sets of attributes whose pairwise correlation is as low as possible.

Wrapper approaches: These methods use the target data mining algorithm as a black box to find
the best subset of attributes, in a way similar to that of the ideal algorithm described above, but
typically without enumerating all possible subsets.

An Architecture for Feature Subset Selection:


• The feature selection process is viewed as consisting of four parts: a measure for evaluating
a subset, a search strategy that controls the generation of a new subset of features, a
stopping criterion, and a validation procedure. Filter methods and wrapper methods differ
only in the way in which they evaluate a subset of features.
• An integral part of the search is an evaluation step to judge how the current subset of
features compares to others that have been considered. This requires an evaluation

Dept of CSE, Vemana I.T Page 31 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

measure that attempts to determine the goodness of a subset of attributes with respect to
a particular data mining task, such as classification or clustering.
• The filter approach, such measures attempt to predict how well the actual data mining
algorithm will perform on a given set of attributes.
• For the wrapper approach, where evaluation consists of actually running the target data
mining application, the subset evaluation function is simply the criterion normally used to
measure the result of the data mining.

The number of subsets can be enormous and it is impractical to examine them all, some sort of
stopping criterion is necessary.
This strategy is usually based on one or more conditions involving the following:
• The number of iterations, whether the value of the subset evaluation measure is optimal or
exceeds a certain threshold, whether a subset of a certain size has been obtained, whether
simultaneous size and evaluation criteria have been achieved, and whether any
improvement can be achieved by the options available to the search strategy.
• Finally, once a subset of features has been selected, the results of the target data mining
algorithm on the selected subset should be validated. A straightforward evaluation
approach is to run the algorithm with the full set of features and compare the full results to
results obtained using the subset of features.

Another validation approach is to use a number of different feature selection algorithms to obtain
subsets of features and then compare the results of running the data mining algorithm on each
subset.

Dept of CSE, Vemana I.T Page 32 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Feature Weighting:
• Feature weighting is an alternative to keeping or eliminating features. More important
features are assigned a higher weight, while less important features are given a lower
weight. These weights are sometimes assigned based on domain knowledge about the
relative importance of features.
• Alternatively, they may be determined automatically. For example, some classification
schemes, such as support vector machines produce classification models in which each
feature is given a weight.
• Features with larger weights play a more important role in the model. The normalization
of objects that takes place when computing the cosine similarity can also be regarded as a
type of feature weighting.

Feature Creation:
• It is frequently possible to create, from the original attributes, a new set of attributes that
captures the important information in a data set much more effectively.
• Three related methodologies for creating new attributes are described next:
o Feature extraction,
o Mapping the data to a new space, and
o Feature construction.

Feature Extraction:
The creation of a new set of features from the original raw data is known as feature extraction.

Mapping the Data to a New Space:


A totally different view of the data can reveal important and interesting features.

Feature Construction:
Sometimes the features in the original data sets have the necessary information, but it is not in a
form suitable for the data mining algorithm. In this situation, one or more new features
constructed out of the original features can be more useful than the original features.
Example 2.11- (Density). Consider a data set consisting of information about historical artifacts,
which, along with other information, contains the volume and mass of each artifact. In this case, a

Dept of CSE, Vemana I.T Page 33 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

density feature constructed from the mass and volume features, i.e., density=mass/volume, would
most directly yield an accurate classification.

Although there have been some attempts to automatically perform feature construction by
exploring simple mathematical combinations of existing attributes, the most common approach is
to construct features using domain expertise.

Discretization and Binarization:


Some data mining algorithms, especially certain classification algorithms, require that the data be
in the form of categorical attributes. Algorithms that find association patterns require that the data
be in the form of binary attributes. Thus, it is often necessary to transform a continuous attribute
into a categorical attribute (discretization), and both continuous and discrete attributes may need
to be transformed into one or more binary attributes (binarization).

Binarization
A simple technique to binarize a categorical attribute is the following:
• If there are m categorical values, then uniquely assign each original value to an integer in
the interval [0,m - 1].
• If the attribute is ordinal, then order must be maintained by the assignment. Next, convert
each of these m integers to a binary number. Since n log 2 (𝑚) binary digits are required to
represent these integers, represent these binary numbers using n binary attributes.
• To illustrate, a categorical variable with 5 values {awful, poor, OK, good, great} would
require three binary variables x1,x2,x3 . The conversion is shown in Table 2.5.

Dept of CSE, Vemana I.T Page 34 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Variable Transformation:
• A variable transformation refers to a transformation that is applied to all the values of a
variable. In other words, for each object, the transformation is applied to the value of the
variable for that object.
• For example, if only the magnitude of a variable is important, then the values of the variable
can be transformed by taking the absolute value.
• Two important types of variable transformations:
o simple functional transformations and
o normalization.

Simple Functions:
• For this type of variable transformation, a simple mathematical function is applied to each
value individually. If r is a variable, then examples of such transformations include xk , log
x, ex, √x, 1/x, sin x, or |x|. In statistics, variable transformations, especially sqrt, log , and
1/x, are often used to transform data that does not have a Gaussian (normal) distribution
into data that does.
• Variable transformations should be applied with caution since they change the nature of
the data. While this is what is desired, there can be problems if the nature of the
transformation is not fully appreciated. For instance, the transformation If r reduces the
magnitude of values that are 1 or larger, but increases the magnitude of values between 0
and 1.
• To illustrate, the values {1,2,3} go to {1, 1/2, 1/3}, but the value {1, ½,1/3} go to {1,2,3} .
Thus or all sets of values, the transformation If r reverse the order.

Dept of CSE, Vemana I.T Page 35 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Normalization or Standardization:
• Another common type of variable transformation is the standardization or normalization
of a variable. (In the data mining community the terms are often used interchangeably. In
statistics, however, the term normalization can be confused with the transformations used
for making a variable normal, i.e', Gaussian.)
• The goal of standardization or normalization is to make an entire set of values have a
particular property.

Measures of Similarity and Dissimilarity:

Definitions
• Informally, the similarity between two objects is a numerical measure of the degree to
which the two objects are alike. Consequently, similarities are higher for pairs of objects
that are more alike. Similarities are usually non-negative and are often between 0 (no
similarity) and 1 (complete similarity).
• The dissimilarity between two objects is a numerical measure of the degree to which the
two objects are different. Dissimilarities are lower for more similar pairs of objects.
Frequently, the term distance is used as a synonym for dissimilarity, although, as we shall
see, distance is often used to refer to a special class of dissimilarities. Dissimilarities
sometimes fall in the interval [0,1], but it is also common for them to range from 0 to ∞.

• The term proximity is used to refer to either similarity or dissimilarity. Since the proximity
between two objects is a function of the proximity between the corresponding attributes of
the two objects.

Transformations:
• Transformations are often applied to convert a similarity to a dissimilarity, or vice versa,
or to transform a proximity measure to fall within a particular range, such as [0,1].
• Frequently, proximity measures, especially similarities, are defined or transformed to have
values in the interval [0,1]. The motivation for this is to use a scale in which a proximity
value indicates the fraction of similarity (or dissimilarity) between two objects. Such a
transformation is often relatively straightforward.

Dept of CSE, Vemana I.T Page 36 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• For example, if the similarities between objects range from 1 (not at all similar) to 10
(completely similar), we can make them fall within the range [0, 1] by using the

transformation where and s and s’ are the original and new similarity
values, respectively.
• In the more general case, the transformation of similarities to the interval [0,1] is given by

the expression where max_s and min_s are the


maximum and minimum similarity values, respectively.
• Likewise, dissimilarity measures with a finite range can be mapped to the interval [0,1] by
using the formula

Similarity and Dissimilarity between Simple Attributes:


The proximity of objects with a number of attributes is typically defined by combining the
proximities of individual attributes.
Consider p and q are the attribute values for two data objects

Dissimilarities between Data Objects:


Distances
We first present some examples, and then offer a more formal description of distances in terms of
the properties common to all distances. The Euclidean distance, d, between two points, x and y, in
one-, two-, three-, or higher dimensional space, is given by the following familiar formula:

Dept of CSE, Vemana I.T Page 37 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

where n is the number of dimensions and xk and yk are respectively, the kth attributes
(components) of r and g. We illustrate this formula with Figure 2.15 and Tables 2.8 and 2.9, which
show a set of points, the e and gr coordinates of these points, and the distance matrix containing
the pairwise distances of these points.

The Euclidean distance measure given the Minkowski distance metric shown in below
equation.

where r is a parameter.
The following are the three most common examples of Minkowski distances.

The r parameter should not be confused with the number of dimensions (attributes) n. The
Euclidean, Manhattan, and supremum distances are defined for all values of n: 1,2,3,..., and specify
different ways of combining the differences in each dimension (attribute) into an overall distance.
Tables 2.10 and 2.11, respectively, give the proximity matrices for the L1 and Loo distances using
data from Table 2.8. Notice that all these distance matrices are symmetric; i.e., the ijth entry is the
same as the jith entry. In Table 2.9, for instance, the fourth row of the first column and the fourth
column of the first row both contain the value 5.1.

Dept of CSE, Vemana I.T Page 38 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Distances, such as the Euclidean distance, have some well-known properties.


If d(x, y) is the distance between two points, x and y, then the following properties hold.

Measures that satisfy all three properties are known as metrics

Dept of CSE, Vemana I.T Page 39 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Similarities between Data Objects


For similarities, the triangle inequality (or the analogous property) typically does not hold, but
symmetry and positivity typically do. To be explicit, if s(x, y) is the similarity between points x and
y, then the typical properties of similarities are the following:

MINKOWSKI DISTANCE EXAMPLE:

Dept of CSE, Vemana I.T Page 40 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Similarity Measures for Binary Data


• Similarity measures between objects that contain only binary attributes are called
similarity coefficients, and typically have values between 0 and 1. A value of 1 indicates that
the two objects are completely similar, while a value of 0 indicates that the objects are not
at all similar. There are many rationales for why one coefficients better than another in
specific instances.
• Let x and y be two objects that consist of n binary attributes. The comparison of two such
objects, i.e., two binary vectors, leads to the following four quantities (frequencies:)

Simple Matching Coefficient (SMC):


Simple Matching Coefficient One commonly used similarity coefficient is the simple matching
coefficient (SMC), which is defined as

This measure counts both presences and absences equally. Consequently, the SMC could be used
to find students who had answered questions similarly on a test that consisted only of true/false
questions

Jaccard Coefficient (J):


• Jaccard Coefficient Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix (see Section 2.1.2).
• If each asymmetric binary attribute corresponds to an item in a store, then a 1 indicates
that the item was purchased, while a 0 indicates that the product was not purchased. Since
the number of products not purchased by any customer far outnumbers the number of
products that were purchased, a similarity measure such as SMC would say that all
transactions are very similar.

Dept of CSE, Vemana I.T Page 41 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• As a result, the Jaccard coefficient is frequently used to handle objects consisting of


asymmetric binary attributes. The Jaccard coefficient, which is often symbolized by, J'is
given by the following equation:

Cosine Similarity:
• Documents are often represented as vectors, where each attribute represents the
frequency with which a particular term (word) occurs in the document.
• It is more complicated than this, of course, since certain common words are ignored and
various processing techniques are used to account for different forms of the same word,
differing document lengths, and different word frequencies.
• Even though documents have thousands or tens of thousands of attributes (terms), each
document is sparse since it has relatively few non-zero attributes. (The normalizations
used for documents do not create a non-zero entry where there was a zero entry; i.e., they
preserve sparsity.)
• Thus, as with transaction data, similarity should not depend on the number of shared 0
values since any two documents are likely to "not contain" many of the same words, and
therefore, if 0-0 matches are counted, most documents will be highly similar to most other

Dept of CSE, Vemana I.T Page 42 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

documents. Therefore, a similarity measure for documents needs to ignores 0-0 matches
like the Jaccard measure, but also must be able to handle non-binary vectors.
• The cosine similarity, defined next, is one of the most common measure of document
similarity. If x and y are two document vectors, then

Extended Jaccard Coefficient (Tanimoto Coefficient):


The extended Jaccard coefficient can be used for document data and that reduces to the Jaccard
coefficient in the case of binary attributes. The extended Jaccard coefficient is also known as the
Tanimoto coefficient. (However, there is another coefficient that is also known as the Tanimoto
coefficient.) This coefficient, which we shall represent as EJ , is defined by the following equation:

Correlation:
• The correlation between two data objects that have binary or continuous variables is a
measure of the linear relationship between the attributes of the objects. (The calculation of
correlation between attributes, which is more common, can be defined similarly.)
• More precisely, Pearson's correlation coefficient between two data objects, x and y, is
defined by the following equation:

Dept of CSE, Vemana I.T Page 43 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Issues in Proximity Calculation


This section discusses several important issues related to proximity measures:
(1) how to handle the case in which attributes have different scales and/or are correlated,
(2) how to calculate proximity between objects that are composed of different types of
attributes, e.g., quantitative and qualitative,
(3) and how to handle proximity calculation when attributes have different weights; i.e.,
when not all attributes contribute equally to the proximity of objects

Standardization and Correlation for Distance Measures


• An important issue with distance measures is how to handle the situation when attributes
do not have the same range of values. (This situation is often described by saying that "the
variables have different scales.") Earlier, Euclidean distance was used to measure the
distance between people based on two attributes: age and income. Unless these two
attributes are standardized, the distance between two people will be dominated by income.

Dept of CSE, Vemana I.T Page 44 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

• A related issue is how to compute distance when there is correlation between some of the
attributes, perhaps in addition to differences in the ranges of values. A generalization of
Euclidean distance, the Mahalanobis distance, is useful when attributes are correlated, have
different ranges of values (different variances), and the distribution of the data is
approximately Gaussian (normal). Specifically, the Mahalanobis distance between two
objects (vectors) x and y is defined as

Combining Similarities for heterogeneous Attributes


• The previous definitions of similarity were based on approaches that assumed all the
attributes were of the same type
• A general approach is needed when the attributes are of different types. One
straightforward approach is to compute the similarity between each attribute separately
using Table 2.7, and then combine these similarities using a method that results in a
similarity between 0 and 1.
• Typically, the overall similarity is defined as the average of all the individual attribute
similarities. Unfortunately, this approach does not work well if some of the attributes are
asymmetric attributes.
For example, if all the attributes are asymmetric binary attributes, then the similarity
measure suggested previously reduces to the simple matching coefficient, a measure that
is not appropriate for asymmetric binary attributes. The easiest way to fix this problem is
to omit asymmetric attributes from the similarity calculation when their values are 0 for
both of the objects whose similarity is being computed. A similar approach also works well
for handling missing values.

Dept of CSE, Vemana I.T Page 45 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

Using Weights:
• In much of the previous discussion, all attributes were treated equally when computing
proximity. This is not desirable when some attributes are more important to the definition
of proximity than others. To address these situations,

Problems:
1)Classify the following attributes as binary, discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than
one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM. Answer: Binary, qualitative, ordinal
(b) Brightness as measured by a light meter. Answer: Continuous, quantitative, ratio
(c) Brightness as measured by people’s judgments. Answer: Discrete, qualitative, ordinal

Dept of CSE, Vemana I.T Page 46 of 47


Data Mining and Data Warehousing (18CS641) Module 2- Data warehouse implementation& Data mining

(d) Angles as measured in degrees between 0◦ and 360◦ . Answer: Continuous, quantitative, ratio
(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Answer: Discrete, qualitative,
ordinal
(f) Height above sea level. Answer: Continuous, quantitative, interval/ratio (depends on whether
sea level is regarded as an arbitrary origin)
(g) Number of patients in a hospital. Answer: Discrete, quantitative, ratio
(h) ISBN numbers for books. (Look up the format on the Web (ISBN numbers do have order
information, though) .) Answer: Discrete, qualitative, nominal
(i) Ability to pass light in terms of the following values: opaque, translucent, transparent. Answer:
Discrete, qualitative, ordinal
(j) Military rank. Answer: Discrete, qualitative, ordinal
(k) Distance from the center of campus. Answer: Continuous, quantitative, interval/ratio
(depends)
(l) Density of a substance in grams per cubic centimeter. Answer: Discrete, quantitative, ratio
(m) Coat check number. (When you attend an event, you can often give your coat to someone who,
in turn, gives you a number that you can use to claim your coat when you leave.) Answer:
Discrete, qualitative, nominal

2) Compute the Hamming distance and the Jaccard similarity between the following two binary
vectors
x = 0101010001
y = 0100011000
Solution: Hamming distance = number of different bits = 3
Jaccard Similarity = number of 1-1 matches /( number of bits – number matches) = 2 / 5 = 0.4

3)For the following vectors, x and y, calculate the indicated similarity or distance measures.
(a) x=: (1, 1, 1, 1), y : (2,2,2,2) cosine, correlation, Euclidean
(b) x=: (0, 1,0, 1), y : (1,0, 1,0) cosine, correlation, Euclidean, Jaccard
(c) x= (0,- 1,0, 1) , y: (1,0,- 1,0) ) cosine, correlation Euclidean
(d) x = (1,1 ,0,1 ,0,1 ) , y : (1,1 ,1 ,0,0,1 ) ) cosine, correlation ,Jaccard
(e) x = (2, -7,0,2,0, -3) , y : ( -1, 1,- 1,0,0, -1) cosine, correlation

Dept of CSE, Vemana I.T Page 47 of 47

You might also like