DM Module 2
DM Module 2
MODULE 2
Data warehouse implementation & Data mining
• The base cuboid contains all three dimensions, city, item, and year. It can return the total sales
for any combination of the three dimensions. The apex cuboid, or 0-D cuboid, refers to the case
where the group-by is empty. It contains the total sum of all sales.
• The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the
most generalized (least specific) of the cuboids, and is often denoted as all.
• An SQL query containing no group-by(e.g.,“compute the sum of total sales”) is a zero
dimensional operation. An SQL query containing one group-by (e.g., “compute the sum of sales,
group-bycity”) is a one-dimensional operation. A cube operator on n dimensions is equivalent
to a collection of group-by statements, one for each subset of the n dimensions. Therefore, the
cube operator is the n-dimensional generalization of the group-by operator. Similar to the SQL
syntax, the data cube in Example could be defined as:
define cube sales cube [city, item, year]: sum(sales in dollars)
• For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid. A
statement such as compute cube sales cube would explicitly instruct the system to compute
the sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the empty
subset.
curse of dimensionality:
• A major challenge related to this precomputation, however, is that the required storage space
may explode if all the cuboids in a data cube are precomputed, especially when the cube has
many dimensions.
• The storage requirements are even more excessive when many of the dimensions have
associated concept hierarchies, each with multiple levels. This problem is referred to as the
curse of dimensionality.
Example: time is usually explored not at only one conceptual level (e.g., year), but rather at multiple
conceptual levels such as in the hierarchy “day <month < quarter < year.” For an n-dimensional
data cube, the total number of cuboids that can be generated (including the cuboids generated by
climbing up the hierarchies along each dimension) is
where Li is the number of levels associated with dimension i. One is added to Li to include the
virtual top level, all.
• The selection of the subset of cuboids or subcubes to materialize should take into account the
queries in the workload, their frequencies, and their accessing costs. In addition, it should consider
workload characteristics, the cost for incremental updates, and the total storage requirements.
• A popular approach is to materialize the cuboids set on which other frequently referenced
cuboids are based. Alternatively, we can compute an iceberg cube, which is a data cube that
stores only those cube cells with an aggregate value (e.g., count) that is above some minimum
support threshold.
• Another common strategy is to materialize a shell cube. This involves precomputing the
cuboids for only a small number of dimensions (e.g., three to five) of a data cube.
• Once the selected cuboids have been materialized, it is important to take advantage of Them
during query processing. This involves several issues, such as
o how to determine the relevant cuboid(s) from among the candidate materialized
cuboids,
o how to use available index structures on the materialized cuboids, and
o how to transform the OLAP operations onto the selected cuboid(s).
Finally, during load and refresh, the materialized cuboids should be updated efficiently.
The join indexing method gained popularity from its use in relational database query
processing. Traditional indexing maps the value in a given column to a list of rows having that
value. In contrast, join indexing registers the joinable rows of two relations from a relational
database.
• In data warehouses, join index relates the values of the dimensions of a start schema to
rows in the fact table.
o E.g. fact table: Sales and two dimensions city and product
▪ A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city
o Join indices can span multiple dimensions
1. Determine which operations should be performed on the available cuboids: This involves
transforming any selection, projection, roll-up (group-by), and drill-down operations specified in
the query into corresponding SQL and/or OLAP operations.
For example, slicing and dicing a data cube may correspond to selection and/or projection
operations on a materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be applied:
This involves identifying all of the materialized cuboids that may potentially be used to answer the
query, pruning the set using knowledge of “dominance” relationships among the cuboids,
estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with
the least cost.
▪ cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the
query. However, if efficient indices are available for cuboid 4, then cuboid 4 may be a better
choice.
Data Mining:
Datamining is a technology that blends traditional data analysis methods with
sophisticated algorithm for processing of large volume of data.
• It has also opened for existing opportunities foe exploring and analyzing new type of data
and for analyzing different types of data in new ways.
• Datamining is a process of automatically discovering useful information in large data
repositories.
Applications of data mining
• Find all credit applicants who are poor credit risks. (classification)
• Identify customers with similar buying habits. (Clustering)
• Find all items which are frequently purchased with milk. (association rules)
• Discover groups of similar documents on the Web
• Certain names are more popular in certain locations
• The input data can be stored in a variety of formats (flat files, spreadsheets, or relational
tables) and may reside in a centralized data repository or be distributed across multiple
sites.
• The purpose of preprocessing is to transform the raw input data into an appropriate format
for subsequent analysis.
• The steps involved in data preprocessing include
▪ fusing data from multiple sources,
▪ cleaning data to remove noise and duplicate observations, and
▪ selecting records and features that are relevant to the data mining task, this
is the most time-consuming step in the overall knowledge discovery process.
• Postprocessing step that ensures that only valid and useful results are incorporated into
the decision support system.
Motivating Challenges:
• Scalability: Advances in data generation and collection, data sets with sizes of gigabytes,
terabytes, or even petabytes are becoming common. If data mining algorithms are to handle
these massive data sets, then they must be scalable.
• High Dimensionality: It is now common to encounter data sets with hundreds or
thousands of attributes Data sets with temporal or spatial components also tend to have
high dimensionality. For example, consider a data set that contains measurements of
temperature at various locations the computational complexity increases rapidly as the
dimensionality (the number of features) increases.
• Heterogeneous and Complex Data: Traditional data analysis methods often deal with
data sets containing attributes of the same type, either continuous or categorical. As the
role of data mining in business, science, medicine, and other flelds has grown, so has the
need for techniques that can handle heterogeneous attributes.
• Data ownership and Distribution: Sometimes, the data needed for an analysis is not
stored in one location or owned by one organization. Instead, the data is geographically
distributed among resources belonging to multiple entities. This requires the development
of distributed data mining techniques. Among the key challenges faced by distributed data
mining algorithms include (1) how to reduce the amount of communication needed to
perform the distributed computatior, (2) how to effectively consolidate the data mining
results obtained from multiple sources, and (3) how to address data security issues.
• Non-traditional Analysis: The traditional statistical approach is based on a hypothesize-
and-test paradigm. Current data analysis tasks often require the generation and evaluation
of thousands of hypotheses, and consequently, the development of some data mining
techniques has been motivated and it is non-traditional approach.
Predictive modeling refers to the task of building a model for the target variable as a function
of the explanatory variables.
There are two types of predictive modeling tasks:
▪ classification, which is used for discrete target variables, and
▪ regression, which is used for continuous target variables.
The goal of both tasks is to learn a model that minimizes the error between the predicted and true
values of the target variable.
For example, predicting whether a Web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued.
On the other hand, forecasting the future price of a stock is a regression task because price is a
continuous-valued attribute.
correspond to news about the economy, while the second cluster contains the last four articles,
which correspond to news about health care. A good clustering algorithm should be able to identify
these two clusters based on the similarity between words that appear in the articles.
What is Data
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object
Types of Data:
• A data set can often be viewed as a collection of data objects.
• Other names for a data object are record, point, vector, pattern, event, case, sample,
observation, or entity.
• Data objects are described by a number of attributes that capture the basic characteristics
of an object, such as the mass of a physical object or the time at which an event occurred. Other
names for an attribute are variable, characteristic field, feature, or dimension
Qualitative Data: Arise when the observations fall into separate distinct categories.
Example: colors of eyes: brown, black, hezal, green
Quantitative Data (Numeric Data): Arise when observations are counts or measurements.
The data are said to be “discrete” if the measurements are integers.
Example: number of people in the house.
The data are said to be “continuous” if the measurements an take any value, usually within some
range.
Example: weight
Type of an Attribute:
The properties of an attribute need not be the same as the properties of the values used to measure
it. In other words, the values used to represent an attribute may have properties that are not
properties of the attribute itself, and vice versa.
4. Multiplication x and /
Given these properties, we can define four types of attributes: nominal, ordinal, interval, and ratio.
Nominal Attribute
Nominal means “relating to names.” The values of a nominal attribute are symbols or names of
things. Each value represents some kind of category, code, or state, and so nominal attributes are
also referred to as categorical. The values do not have any meaningful order. In computer science,
the values are also known as enumerations.
Ordinal attribute
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is not known.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-
scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify the difference between values.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value.
In addition, the values are ordered, and we can also compute the difference between values, as
well as the mean, median, and mode.
• Practically, real values can only be measured and represented using a finite number of
digits.
• Continuous attributes are typically represented as floating point variables.
Graph-Based Data:
A graph can sometimes be a convenient and powerful representation for data. We consider two
specific cases: (1) the graph captures relationships among data objects and (2) the data objects
themselves are represented as graphs
Data with Relationships among Objects The relationships among objects frequently convey
important information. In such cases, the data is often represented as a graph. In particular, the
data objects are mapped to nodes of the graph, while the relationships among objects are captured
by the links between objects and link properties, such as direction and weight. Consider Web pages
on the World Wide Web, which contain both text and links to other pages.
Data with Objects That Are Graphs If objects have structure, that is, the objects contain
subobjects that have relationships, then such objects are frequently represented as graphs. For
example, the structure of chemical compounds can be represented by a graph, where the nodes
are atoms and the links between nodes are chemical bonds.
Ordered Data
For some types of data, the attributes have relationships that involve order in time or space.
Sequence Data:
Sequence data consists of a data set that is a sequence of individual entities, such as a sequence of
words or letters. It is quite similar to sequential data, except that there are no time stamps; instead,
there are positions in an ordered sequence.
Example: the genetic information of plants and animals can be represented in the form of
sequences of nucleotides that are known as genes.
Spatial Data:
Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.
Example: Weather data (precipitation, temperature, pressure) that is collected for a variety of
geographical locations.
Data Quality:
• Data mining applications are often applied to data that was collected for another purpose,
or for future, but unspecified applications.
• For that reason ,data mining cannot usually take advantage of the significant benefits of
"addressing quality issues at the source."
• Data mining focuses on
(1) the detection and correction of data quality problems (called data cleaning.)
(2) the use of algorithms that can tolerate poor data quality.
Data errors may be the result of a more deterministic phenomenon, such as a streak in the same
place on a set of photographs. Such deterministic distortions of the data are often referred to as
artifacts.
Precision is often measured by the standard deviation of a set of values, while bias is measured by
taking the difference between the mean of the set of values and the known value of the quantity
being measured It is common to use the more general term, accuracy, to refer to the degree of
measurement error in data.
Outliers:
Outliers are either (1) data objects that, in some sense, have characteristics that are different from
most of the other data objects in the data set, or (2) values of an attribute that are unusual with
respect to the typical values for that attribute. Alternatively, we can speak of anomalous objects or
values. it is important to distinguish between the notions of noise and outliers. Outliers can be
legitimate data objects or values. Thus, unlike noise, outliers may sometimes be of interest. In
fraud and network intrusion detection, for example, the goal is to find unusual objects or events
from among a large number of normal ones.
Missing Values:
It is not unusual for an object to be missing one or more attribute values. In some cases, the
information was not collected; e.g., some people decline to give their age or weight. In other cases,
some attributes are not applicable to all objects; e.g., often, forms have conditional parts that are
filled out only when a person answers a previous question in a certain way, but for simplicity, all
fields are stored Handling missing values
• Eliminate Data Objects
• Estimate Missing Values
• Ignore the Missing Value During Analysis
• Replace with all possible values (weighted by their probabilities)
Inconsistent Values:
Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city It may be that the individual
entering this information transposed two digits, or perhaps a digit was misread when the
information was scanned from a handwritten form. Some types of inconsistences are easy to
detect. For instance, a person's height should not be negative. In other cases, it can be necessary
to consult an external source of information. For example, when an insurance company processes
claims for reimbursement, it checks the names and addresses on the reimbursement forms against
a database of its customers. Once an inconsistency has been detected, it is sometimes possible to
correct the data. A product code may have "check" digits, or it may be possible to double-check a
product code against a list of known product codes, and then correct the code if it is incorrect, but
close to a known code. The correction of an inconsistency requires additional or redundant
information.
Duplicate Data:
A data set may include data objects that are duplicates, or almost duplicates, of one another. To
detect and eliminate such duplicates, two main issues must be addressed. First, if there are two
objects that actually represent a single object, then the values of corresponding attributes may
differ, and these inconsistent values must be resolved. Second, care needs to be taken to avoid
accidentally combining data objects that are similar, but not duplicates, such as two distinct people
with identical names. The term deduplication is often used to refer to the process of dealing with
these issues. In some cases, two or more objects are identical with respect to the attributes
measured by the database, but they still represent different objects. Here, the duplicates are
legitimate, but may still cause problems for some algorithms if the possibility of identical objects
is not specifically accounted for in their design.
Data Preprocessing:
• Is a datamining technique that involves transforming raw data into an understandable
format. Real world data is often incomplete, inconsistent and for lacking in certain behavior
and trends. And likely to contain many errors.
• Data preprocessing is a process method of resolving such issues.
• Some of the most important approaches for data preprocessing are:
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation
These approaches fall into two categories: selecting data objects and attributes for the analysis or
creating/changing the attributes.
Aggregation:
• Aggregation is combining of two or more objects into a single object. Data set consisting
of transactions (data objects) recording the daily sales of products in various store
locations (Minneapolis, Chicago, Paris, ...) for different days over the course of a year.
• Example: Merging daily sales data figures to obtain monthly sales figures.
• There are several motivations for aggregation:
o First, the smaller data sets resulting from data reduction require less memory and
processing time, and hence, aggregation may permit the use of more expensive data
mining algorithms.
o Second, aggregation can act as a change of scope or scale by providing a high-level
view of the data instead of a low-level view.
Sampling:
• Sampling is a commonly used approach for selecting a subset of the data objects to be
analyzed. In statistics, it has long been used for both the preliminary investigation of the
data and the final data analysis.
• The key principle for effective sampling is the following: Using a sample will work almost
as well as using the entire data set if the sample is representative.
• A sample is representative if it has approximately the same property (of interest) as the
original set of data.
Sampling Approaches:
There are two variations on random sampling (and other sampling techniques as well):
(1) sampling without replacement-as each item is selected, it is removed from the set of
all objects that together constitute the population, and
(2) sampling with replacement-objects are not removed from the population as they are
selected for the sample.
In sampling with replacement, the same object can be picked more than once. The samples
produced by the two methods are not much different when samples are relatively small compared
to the data set size, but sampling with replacement is simpler to analyze since the probability of
selecting any object remains constant during the sampling process.
Stratified sampling: which starts with prespecified groups of objects, is such an approach. In the
simplest version, equal numbers of objects are drawn from each group even though the groups are
of different sizes. In another variation, the number of objects drawn from each group is
proportional to the size of that group.
Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or
progressive sampling schemes are sometimes used. These approaches start with a small sample,
and then increase the sample size until a sample of sufficient size has been obtained.
Dimensionality Reduction:
• Data sets can have a large number of features. There are a variety of benefits to
dimensionality reduction.
• A key benefit is that many data mining algorithms work better if the dimensionality the
number of attributes in the data-is lower.
• Another benefit is that a reduction of dimensionality can lead to a more understandable
model because the model may involve fewer attributes.
• Also, dimensionality reduction may allow the data to be more easily visualized. Even if
dimensionality reduction doesn't reduce the data to two or three dimensions, data is often
visualized by looking at pairs or triplets of attributes, and the number of such combinations
is greatly reduced.
• The amount of time and memory required by the data mining algorithm is reduced with a
reduction in dimensionality.
• The term dimensionality reduction is often reserved for those techniques that reduce the
dimensionality of a data set by creating new attributes that are a combination of the old
attributes. The reduction of dimensionality by selecting new attributes that are a subset of
the old is known as feature subset selection or feature selection.
Filter approaches: Features are selected before the data mining algorithm is run, using some
approach that is independent of the data mining task.
For example, we might select sets of attributes whose pairwise correlation is as low as possible.
Wrapper approaches: These methods use the target data mining algorithm as a black box to find
the best subset of attributes, in a way similar to that of the ideal algorithm described above, but
typically without enumerating all possible subsets.
measure that attempts to determine the goodness of a subset of attributes with respect to
a particular data mining task, such as classification or clustering.
• The filter approach, such measures attempt to predict how well the actual data mining
algorithm will perform on a given set of attributes.
• For the wrapper approach, where evaluation consists of actually running the target data
mining application, the subset evaluation function is simply the criterion normally used to
measure the result of the data mining.
The number of subsets can be enormous and it is impractical to examine them all, some sort of
stopping criterion is necessary.
This strategy is usually based on one or more conditions involving the following:
• The number of iterations, whether the value of the subset evaluation measure is optimal or
exceeds a certain threshold, whether a subset of a certain size has been obtained, whether
simultaneous size and evaluation criteria have been achieved, and whether any
improvement can be achieved by the options available to the search strategy.
• Finally, once a subset of features has been selected, the results of the target data mining
algorithm on the selected subset should be validated. A straightforward evaluation
approach is to run the algorithm with the full set of features and compare the full results to
results obtained using the subset of features.
Another validation approach is to use a number of different feature selection algorithms to obtain
subsets of features and then compare the results of running the data mining algorithm on each
subset.
Feature Weighting:
• Feature weighting is an alternative to keeping or eliminating features. More important
features are assigned a higher weight, while less important features are given a lower
weight. These weights are sometimes assigned based on domain knowledge about the
relative importance of features.
• Alternatively, they may be determined automatically. For example, some classification
schemes, such as support vector machines produce classification models in which each
feature is given a weight.
• Features with larger weights play a more important role in the model. The normalization
of objects that takes place when computing the cosine similarity can also be regarded as a
type of feature weighting.
Feature Creation:
• It is frequently possible to create, from the original attributes, a new set of attributes that
captures the important information in a data set much more effectively.
• Three related methodologies for creating new attributes are described next:
o Feature extraction,
o Mapping the data to a new space, and
o Feature construction.
Feature Extraction:
The creation of a new set of features from the original raw data is known as feature extraction.
Feature Construction:
Sometimes the features in the original data sets have the necessary information, but it is not in a
form suitable for the data mining algorithm. In this situation, one or more new features
constructed out of the original features can be more useful than the original features.
Example 2.11- (Density). Consider a data set consisting of information about historical artifacts,
which, along with other information, contains the volume and mass of each artifact. In this case, a
density feature constructed from the mass and volume features, i.e., density=mass/volume, would
most directly yield an accurate classification.
Although there have been some attempts to automatically perform feature construction by
exploring simple mathematical combinations of existing attributes, the most common approach is
to construct features using domain expertise.
Binarization
A simple technique to binarize a categorical attribute is the following:
• If there are m categorical values, then uniquely assign each original value to an integer in
the interval [0,m - 1].
• If the attribute is ordinal, then order must be maintained by the assignment. Next, convert
each of these m integers to a binary number. Since n log 2 (𝑚) binary digits are required to
represent these integers, represent these binary numbers using n binary attributes.
• To illustrate, a categorical variable with 5 values {awful, poor, OK, good, great} would
require three binary variables x1,x2,x3 . The conversion is shown in Table 2.5.
Variable Transformation:
• A variable transformation refers to a transformation that is applied to all the values of a
variable. In other words, for each object, the transformation is applied to the value of the
variable for that object.
• For example, if only the magnitude of a variable is important, then the values of the variable
can be transformed by taking the absolute value.
• Two important types of variable transformations:
o simple functional transformations and
o normalization.
Simple Functions:
• For this type of variable transformation, a simple mathematical function is applied to each
value individually. If r is a variable, then examples of such transformations include xk , log
x, ex, √x, 1/x, sin x, or |x|. In statistics, variable transformations, especially sqrt, log , and
1/x, are often used to transform data that does not have a Gaussian (normal) distribution
into data that does.
• Variable transformations should be applied with caution since they change the nature of
the data. While this is what is desired, there can be problems if the nature of the
transformation is not fully appreciated. For instance, the transformation If r reduces the
magnitude of values that are 1 or larger, but increases the magnitude of values between 0
and 1.
• To illustrate, the values {1,2,3} go to {1, 1/2, 1/3}, but the value {1, ½,1/3} go to {1,2,3} .
Thus or all sets of values, the transformation If r reverse the order.
Normalization or Standardization:
• Another common type of variable transformation is the standardization or normalization
of a variable. (In the data mining community the terms are often used interchangeably. In
statistics, however, the term normalization can be confused with the transformations used
for making a variable normal, i.e', Gaussian.)
• The goal of standardization or normalization is to make an entire set of values have a
particular property.
Definitions
• Informally, the similarity between two objects is a numerical measure of the degree to
which the two objects are alike. Consequently, similarities are higher for pairs of objects
that are more alike. Similarities are usually non-negative and are often between 0 (no
similarity) and 1 (complete similarity).
• The dissimilarity between two objects is a numerical measure of the degree to which the
two objects are different. Dissimilarities are lower for more similar pairs of objects.
Frequently, the term distance is used as a synonym for dissimilarity, although, as we shall
see, distance is often used to refer to a special class of dissimilarities. Dissimilarities
sometimes fall in the interval [0,1], but it is also common for them to range from 0 to ∞.
• The term proximity is used to refer to either similarity or dissimilarity. Since the proximity
between two objects is a function of the proximity between the corresponding attributes of
the two objects.
Transformations:
• Transformations are often applied to convert a similarity to a dissimilarity, or vice versa,
or to transform a proximity measure to fall within a particular range, such as [0,1].
• Frequently, proximity measures, especially similarities, are defined or transformed to have
values in the interval [0,1]. The motivation for this is to use a scale in which a proximity
value indicates the fraction of similarity (or dissimilarity) between two objects. Such a
transformation is often relatively straightforward.
• For example, if the similarities between objects range from 1 (not at all similar) to 10
(completely similar), we can make them fall within the range [0, 1] by using the
transformation where and s and s’ are the original and new similarity
values, respectively.
• In the more general case, the transformation of similarities to the interval [0,1] is given by
where n is the number of dimensions and xk and yk are respectively, the kth attributes
(components) of r and g. We illustrate this formula with Figure 2.15 and Tables 2.8 and 2.9, which
show a set of points, the e and gr coordinates of these points, and the distance matrix containing
the pairwise distances of these points.
The Euclidean distance measure given the Minkowski distance metric shown in below
equation.
where r is a parameter.
The following are the three most common examples of Minkowski distances.
The r parameter should not be confused with the number of dimensions (attributes) n. The
Euclidean, Manhattan, and supremum distances are defined for all values of n: 1,2,3,..., and specify
different ways of combining the differences in each dimension (attribute) into an overall distance.
Tables 2.10 and 2.11, respectively, give the proximity matrices for the L1 and Loo distances using
data from Table 2.8. Notice that all these distance matrices are symmetric; i.e., the ijth entry is the
same as the jith entry. In Table 2.9, for instance, the fourth row of the first column and the fourth
column of the first row both contain the value 5.1.
This measure counts both presences and absences equally. Consequently, the SMC could be used
to find students who had answered questions similarly on a test that consisted only of true/false
questions
Cosine Similarity:
• Documents are often represented as vectors, where each attribute represents the
frequency with which a particular term (word) occurs in the document.
• It is more complicated than this, of course, since certain common words are ignored and
various processing techniques are used to account for different forms of the same word,
differing document lengths, and different word frequencies.
• Even though documents have thousands or tens of thousands of attributes (terms), each
document is sparse since it has relatively few non-zero attributes. (The normalizations
used for documents do not create a non-zero entry where there was a zero entry; i.e., they
preserve sparsity.)
• Thus, as with transaction data, similarity should not depend on the number of shared 0
values since any two documents are likely to "not contain" many of the same words, and
therefore, if 0-0 matches are counted, most documents will be highly similar to most other
documents. Therefore, a similarity measure for documents needs to ignores 0-0 matches
like the Jaccard measure, but also must be able to handle non-binary vectors.
• The cosine similarity, defined next, is one of the most common measure of document
similarity. If x and y are two document vectors, then
Correlation:
• The correlation between two data objects that have binary or continuous variables is a
measure of the linear relationship between the attributes of the objects. (The calculation of
correlation between attributes, which is more common, can be defined similarly.)
• More precisely, Pearson's correlation coefficient between two data objects, x and y, is
defined by the following equation:
• A related issue is how to compute distance when there is correlation between some of the
attributes, perhaps in addition to differences in the ranges of values. A generalization of
Euclidean distance, the Mahalanobis distance, is useful when attributes are correlated, have
different ranges of values (different variances), and the distribution of the data is
approximately Gaussian (normal). Specifically, the Mahalanobis distance between two
objects (vectors) x and y is defined as
Using Weights:
• In much of the previous discussion, all attributes were treated equally when computing
proximity. This is not desirable when some attributes are more important to the definition
of proximity than others. To address these situations,
Problems:
1)Classify the following attributes as binary, discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than
one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM. Answer: Binary, qualitative, ordinal
(b) Brightness as measured by a light meter. Answer: Continuous, quantitative, ratio
(c) Brightness as measured by people’s judgments. Answer: Discrete, qualitative, ordinal
(d) Angles as measured in degrees between 0◦ and 360◦ . Answer: Continuous, quantitative, ratio
(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Answer: Discrete, qualitative,
ordinal
(f) Height above sea level. Answer: Continuous, quantitative, interval/ratio (depends on whether
sea level is regarded as an arbitrary origin)
(g) Number of patients in a hospital. Answer: Discrete, quantitative, ratio
(h) ISBN numbers for books. (Look up the format on the Web (ISBN numbers do have order
information, though) .) Answer: Discrete, qualitative, nominal
(i) Ability to pass light in terms of the following values: opaque, translucent, transparent. Answer:
Discrete, qualitative, ordinal
(j) Military rank. Answer: Discrete, qualitative, ordinal
(k) Distance from the center of campus. Answer: Continuous, quantitative, interval/ratio
(depends)
(l) Density of a substance in grams per cubic centimeter. Answer: Discrete, quantitative, ratio
(m) Coat check number. (When you attend an event, you can often give your coat to someone who,
in turn, gives you a number that you can use to claim your coat when you leave.) Answer:
Discrete, qualitative, nominal
2) Compute the Hamming distance and the Jaccard similarity between the following two binary
vectors
x = 0101010001
y = 0100011000
Solution: Hamming distance = number of different bits = 3
Jaccard Similarity = number of 1-1 matches /( number of bits – number matches) = 2 / 5 = 0.4
3)For the following vectors, x and y, calculate the indicated similarity or distance measures.
(a) x=: (1, 1, 1, 1), y : (2,2,2,2) cosine, correlation, Euclidean
(b) x=: (0, 1,0, 1), y : (1,0, 1,0) cosine, correlation, Euclidean, Jaccard
(c) x= (0,- 1,0, 1) , y: (1,0,- 1,0) ) cosine, correlation Euclidean
(d) x = (1,1 ,0,1 ,0,1 ) , y : (1,1 ,1 ,0,0,1 ) ) cosine, correlation ,Jaccard
(e) x = (2, -7,0,2,0, -3) , y : ( -1, 1,- 1,0,0, -1) cosine, correlation