unit2
unit2
Syllabus:
Data Mining: Introduction, Data Mining, Motivating challenges, Origins of Data Mining, Data
Mining Tasks, Types of Data, Data Quality.
End of Syllabus
Data mining is the process of automatically discovering useful information in large data repositories.
Data mining techniques are deployed to scour large databases in order to find novel and useful
patterns. They also provide capabilities to predict the outcome of a future observation. Data mining is
an integral part of knowledge discovery in databases (KDD), which is the overall process of
converting raw data into useful information, as shown in Figure 1.1. This process consists of a series
of transformation steps, from data pre-processing to post-processing of data mining results.
The input data can be stored in a variety of formats like flat files, spreadsheets, or relational tables.
The data may reside in a centralized data repository or be distributed across multiple sites. Pre-
processing transforms the raw input data into an appropriate format suitable for subsequent analysis.
The steps involved in data pre-processing include fusing data from multiple sources, cleaning data to
remove noise and duplicate observations, and selecting records and features that are relevant to the
data mining task at hand. Data pre-processing is the most laborious and time-consuming step in the
overall knowledge discovery process. A post-processing step ensures that only valid and useful results
are incorporated into the decision support system. An example of post-processing is visualization,
which allows analysts to explore the data and the data mining results from a variety of viewpoints.
Figure 1.1
Motivating Challenges:
Traditional data analysis techniques may not meet well the challenges posed by new data sets.
Following are some of the specific challenges that motivated the development of data mining.
Scalability: Because of advances in data generation and collection, data sets with huge sizes are
becoming common. The data mining algorithms used to handle these massive data sets must be
scalable. Many data mining algorithms employ special search strategies to handle exponential search
problems. Scalability may also require the implementation of novel data structures to access
individual records in an efficient manner.
High Dimensionality: Data sets may contain hundreds of attributes. In bioinformatics, gene
expression data may involve large number of features. Data sets with temporal or spatial components
also tend to have high dimensionality. Traditional data analysis techniques that were developed for
low dimensional data often do not work well for highdimensional data. The computational complexity
also increases rapidly as the dimensionality increases.
Heterogeneous and Complex Data: Traditional data analysis methods often deal with data sets
containing attributes of the same type, either continuous or categorical. Efficient techniques are
needed to handle heterogeneous attributes. More complex data objects are also emerging in the recent
days. Examples of such data include collections of Web pages containing semi-structured text and
hyperlinks; DNA data with sequential and three-dimensional structure.
Data ownership and Distribution: Sometimes, the data needed for an analysis is not stored in one
location or owned by one organization. Instead, the data is geographically distributed among
resources belonging to multiple entities. This requires the development of distributed data mining
techniques. The key challenges in this task are: (1) how to reduce the amount of communication
needed to perform the distributed computation, (2) how to effectively consolidate the data mining
results obtained from multiple sources, and (3) how to address data security issues.
Data Mining Tasks: Data mining tasks are generally divided into two major categories:
Predictive Tasks: The objective of these tasks is to predict the value of a particular attribute based on
the values of other attributes. The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the prediction are known as the explanatory
or independent variables.
Descriptive tasks: Here, the objective is to derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data. Descriptive data mining tasks
frequently require post-processing techniques to validate and explain the results.
Predictive Modelling refers to the task of building a model for the target variable as a function of the
explanatory variables. There are two types of predictive modelling tasks: classification, which is used
for discrete target variables, and regression, which is used for continuous target variables. For
example, predicting whether a Web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued. On the other hand, forecasting the
future price of a stock is a regression task because price is a continuous-valued attribute. The goal of
both tasks is to learn a model that minimizes the error between the predicted and true values of the
target variable.
Examples of predictive modelling include: identify customers that will respond to a marketing
campaign, predict disturbances in the Earth's ecosystem, or judge whether a patient has a particular
disease based on the results of medical tests.
Association Analysis is used to discover patterns that describe strongly associated features in the
data. The discovered patterns are typically represented in the form of implication rules or feature
subsets. Because of the exponential size of its search space, the goal of association analysis is to
extract the most interesting patterns in an efficient manner. Useful applications of association analysis
include finding groups of genes that have related functionality, identifying Web pages that are
accessed together, or understanding the relationships between different elements of Earth's climate
system.
Cluster Analysis seeks to find groups of closely related objects so that objects that belong to the same
cluster are more similar to each other than objects that belong to other clusters. Clustering has been
used to group sets of related customers, to group related documents, find areas of the ocean that have
a significant impact on the Earth's climate, and compress data.
Anomaly Detection is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies or outliers. The goal of
an anomaly detection algorithm is to discover the real anomalies and avoid falsely labelling normal
objects as anomalous. In other words, a good anomaly detector must have a high detection rate and a
low false alarm rate. Applications of anomaly detection include the detection of fraud, network
intrusions, unusual patterns of disease, and ecosystem disturbances.
Types of Data
A data set can often be viewed as a collection of data objects. Other names for a data object are
record, point, vector, pattern, event, case, sample, observation, or entity. In turn, data objects are
described by a number of attributes that capture the basic characteristics of an object, such as the mass
of a physical object or the time at which an event occurred. Other names for an attribute are variable,
characteristic, field, feature, or dimension.
General Characteristics of Data Sets: There are three major characteristics that apply to many data
sets and have a significant impact on the data mining techniques. They are: Dimensionality, Sparsity,
and Resolution.
Dimensionality
The dimensionality of a data set is the number of attributes that describe the objects in the data set.
Data with low dimensionality produce good data mining results. The difficulties associated with
analyzing high dimensional data are referred to as the curse of dimensionality. An important
motivation in pre-processing the data is dimensionality reduction.
Sparsity
For some data sets, such as those with asymmetric features, most attributes of an object have values of
0; in many cases, less than 1% of the entries are non-zero. In practical terms, sparsity is an advantage
because usually only the non-zero values need to be stored and manipulated. This results in significant
savings with respect to computation time and storage. Furthermore, some data mining algorithms
work well only for sparse data.
Resolution
It is frequently possible to obtain data at different levels of resolution, and often the properties of the
data are different at different resolutions. For instance, the surface of the Earth seems very uneven at a
resolution of a few meters, but is relatively smooth at a resolution of tens of kilo meters. The patterns
found in the data also depend on the level of resolution. If the resolution is too fine, a pattern may not
be visible or may be buried in noise; if the resolution is too coarse, the pattern may disappear.
There are many types of data sets, and a greater variety of data sets become available for analysis.
Generally, the types of data sets are divided into three groups: record data, graph based data, and
ordered data.
Record Data
Much data mining work assumes that the data set is a collection of records (data objects), each of
which consists of a fixed set of data fields (attributes). In the most basic form of record data, there is
no explicit relationship among records or data fields, and every record (object) has the same set of
attributes. Record data is usually stored either in flat files or in relational databases. The database
serves as a convenient place to find records. Different types of record data are described below and
are illustrated in Figure 2.2
Transaction data is a special type of record data, where each record (transaction) involves a set of
items. Consider a grocery store. The set of products purchased by a customer during one shopping trip
constitutes a transaction, while the individual products that were purchased are the items. This type of
data is called market basket data.
If all data objects in a collection of data have the same set of numeric attributes, then the data objects
can be thought of as points (vectors) in a multidimensional space, where each dimension represents a
distinct attribute describing the object. A set of such data objects can be interpreted as an m by n
matrix, where there are m rows, one for each object, and n columns, one for each attribute.
A sparse data matrix is a special case of a data matrix in which the attributes are of the same type and
are asymmetric; i.e., only non-zero values are important. Transaction data is an example of a sparse
data matrix that has only 0 1 entries. Another common example is document data. A document can be
represented as a term vector, where each term is an attribute of the vector and the value of each
component is the number of times the corresponding term occurs in the document. This representation
of a collection of documents is often called a document-term matrix.
Graph-Based Data
A graph can sometimes be a convenient and powerful representation for data. Two cases are:
The relationships among objects frequently convey important information. In such cases, the data is
often represented as a graph. In particular, the data objects are mapped to nodes of the graph, while
the relationships among objects are captured by the links between objects and link properties, such as
direction and weight. Consider Web pages on the World Wide Web, which contain both text and links
to other pages. Figure 2.3(a) shows a set of linked Web pages.
If objects have structure, that is, the objects contain sub objects that have relationships, then such
objects are frequently represented as graphs. For example, the structure of chemical compounds can
be represented by a graph, where the nodes are atoms and the links between nodes are chemical
bonds. Figure 2.3(b) shows a ball-and-stick diagram of the chemical compound Benzene.
Ordered Data
For some types of data, the attributes have relationships that involve order in time or space.
Sequential Data
Sequential or temporal data is an extension of record data, where each record has a time associated
with it. Consider a retail transaction data set that also stores the time at which the transaction took
place. This time information makes it possible to find useful patterns. Figure 2.4 (a) shows an
example of sequential transaction data.
Sequence Data
Sequence data consists of a data set that is a sequence of individual entities, like words or letters. In
this data, there are no time stamps; instead, there are positions in an ordered sequence. For example,
the genetic information of plants and animals can be represented in the form of sequences of
nucleotides that are known as genes. Figure 2.4(b) shows a section of the human genetic code
expressed using the four nucleotides from which all DNA is constructed: A, T, G, and C.
Time series data is a special type of sequential data in which each record is a time series, i.e., a series
of measurements taken over time. For example, a financial data set might contain objects that are time
series of the daily prices of various stocks. As another example, consider Figure 2.4(c), which shows a
time series of the average monthly temperature for Minneapolis during the years 1982 to 1994.
Spatial Data
Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.
An example of spatial data is weather data (precipitation, temperature, pressure) that is collected for a
variety of geographical locations. An important aspect of spatial data is spatial auto correlation; i.e.,
objects that are physically close tend to be similar in other ways as well.
Data Quality
Data used for any mining application should be of good quality. When there are huge amounts of data
collected, quality of data is a major issue. Data quality problems cannot be prevented at source.
Hence, data mining focuses on (1) the detection and correction of data quality problems and (2) the
use of algorithms that can tolerate poor data quality. Detection and correction of errors, is often called
data cleaning.
These issues refer to problems due to human error, limitations of measuring devices, or flaws in the
data collection process. Values or even entire data objects may be missing. In other cases, there may
be spurious or duplicate objects. For example, there might be two different records for a person who
has recently lived at two different addresses. There may be inconsistencies like a person has a height
of 2 meters, but weighs only 2 kilograms.
The term measurement error refers to any problem resulting from the measuring process. A common
problem is that the value recorded differs from the true value to some extent. For continuous
attributes, the numerical difference of the measured and true value is called the error.
The term data collection error refers to errors such as omitting data objects or attribute values, or
inappropriately including a data object. Both measurement errors and data collection errors can be
either systematic or random. There are certain types of data errors that are commonplace, and there
often exist well-developed techniques for detecting and/or correcting these errors.
Noise is the random component of a measurement error. It may involve the distortion of a value or the
addition of spurious objects. Figure 2.5 shows a time series before and after it has been disrupted by
random noise. If a bit more noise were added to the time series, its shape would be lost. Data errors
may be the result of a more deterministic phenomenon, such as a streak in the same place on a set of
photographs. Such deterministic distortions of the data are often referred to as artifacts.
These are used to measure the quality of the measurement process and the resulting data. Let a set of
repeated measurements of the same quantity are taken and use this set of values to calculate a mean
value that serves as estimate of the true value.
Precision The closeness of repeated measurements (of the same quantity) to one another.
Precision is often measured by the standard deviation of a set of values, while bias is measured by
taking the difference between the mean of the set of values and the known value of the quantity being
measured.
Accuracy The closeness of measurements to the true value of the quantity being measured. Accuracy
depends on precision and bias, but since it is a general concept, there is no specific formula for
accuracy in terms of these two quantities. One important aspect of accuracy is the use of significant
digits.
Outliers
Outliers are either (1) data objects that, in some sense, have characteristics that are different from
most of the other data objects in the data set, or (2) values of an attribute that are unusual with respect
to the typical values for that attribute.
Missing Values
It is not unusual for an object to be missing one or more attribute values. In some cases, the
information was not collected. In other cases, some attributes are not applicable to all objects.
Regardless, missing values should be taken into account during the data analysis. There are several
strategies for dealing with missing data.
A simple and effective strategy is to eliminate objects with missing values. However, even a partially
specified data object contains some information, and if many objects have missing values, then a
reliable analysis can be difficult or impossible.
Sometimes missing data can be reliably estimated. For example, consider a data set that has many
similar data points. In this situation, the missing values can be estimated using the attribute values of
the closest point. If the attribute is continuous, then the average attribute value of the nearest
neighbours is used.
Many data mining approaches can be modified to ignore missing values. In clustering problem, the
similarity between pairs of data objects can be calculated only using non missing attribute values.
Many classification schemes can be modified to work with missing values.
Inconsistent Values
Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city. Some types of inconsistencies are
easy to detect. For instance, a person's height should not be negative. Sometimes inconsistent values
can be corrected.
Duplicate Data
A data set may include data objects that are duplicates, or almost duplicates, of one another. Many
people receive duplicate mailings because they appear in a database multiple times under slightly
different names. These duplicate objects must be detected and removed. Care needs to be taken to
avoid accidentally combining data objects that are similar, but not duplicates.
Timeliness
Some data starts to age as soon as it has been collected. In particular, if the data provides a snapshot
of some ongoing phenomenon or process, such as the purchasing behaviour of customers or Web
browsing patterns, then this snapshot represents reality for only a limited time. If the data is out of
date, then so are the models and patterns that are based on it.
Relevance
The available data must contain the information necessary for the application. Consider the task of
building a model that predicts the accident rate for drivers. If information about the age and gender of
the driver is omitted, then it is likely that the model will have limited accuracy.
Ideally, data sets are accompanied by documentation that describes different aspects of the data; the
quality of this documentation can either aid or hinder the subsequent analysis. For example, if the
documentation identifies several attributes as being strongly related, these attributes are likely to
provide highly redundant information, and we may decide to keep just one.
Types of Attributes
Data Pre-Processing
Preprocessing of data should be done to make the data more suitable for data mining algorithm.
Aggregation
Aggregation combines two or more objects into a single object. Consider a data set consisting of
transactions recording the daily sales of products in various store locations. (Minneapolis, Chicago,
Paris, ...) for different days over the course of a year. One way to aggregate transactions for this data
set is to replace all the transactions of a single store with a single storewide transaction.
Sampling
Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed
instead of taking the entire data. Sampling is very useful in data mining. Data miners sample because
it is too expensive or time consuming to process all the data. In some cases, using a sampling
algorithm can reduce the data size to the point where a better, but more expensive algorithm can be
used. Using a sample will work almost as well as using the entire data set if the sample is
representative.
Sampling Approaches
The simplest type of sampling is simple random sampling. For this type of sampling, there is an
equal probability of selecting any particular item. There are two variations on random sampling: (1)
sampling without replacement (2) sampling with replacement. In sampling with replacement, the
same object can be picked more than once.
Progressive Sampling
The proper sample size can be difficult to determine, so adaptive or progressive sampling schemes
are sometimes used. These approaches start with a small sample, and then increase the sample size
until a sample of sufficient size has been obtained. This technique does not reuire computing the
sample size initially, but it requires that there be a way to evaluate the sample to judge if it is large
enough.
Dimensionality Reduction
Data sets can have a large number of features. Dimensionality refers to the number of attributes or
features. Consider a set of documents, where each document is represented by a vector whose
components are the frequencies with which each word occurs in the document. In such cases, there
are typically many attributes, one for each word in the vocabulary. Many data mining algorithms work
better if the dimensionality is lower. Dimensionality reduction eliminates irrelevant features, reduces
noise, and provides a more understandable model.
The curse of dimensionality refers to the phenomenon that many types of data analysis become
significantly harder as the dimensionality of the data increases. As the dimensionality increases, the
data becomes increasingly sparse in the space that it occupies.
Some of the most common approaches for dimensionality reduction, for continuous data, use
techniques from linear algebra to project the data from a high-dimensional space into a lower-
dimensional space. Principal Components Analysis (PCA) is a linear algebra technique for
continuous attributes that finds new attributes (principal components) that (1) are linear combinations
of the original attributes, (2) are orthogonal (perpendicular) to each other, and (3) capture the
maximum amount of variation in the data. For example, the first two principal components capture as
much of the variation in the data as is possible with two orthogonal attributes that are linear
combinations of the original attributes.
Another way to reduce the dimensionality is to use only a subset of the features. Redundant features
duplicate much or all of the information contained in one or more other attributes. Irrelevant features
contain no useful information for the data mining task. For instance, students' ID numbers are
irrelevant to the task of predicting students' grade point averages. There are three standard approaches
to feature selection: embedded, filter, and wrapper.
Embedded approaches
Feature selection occurs naturally as part of the data mining algorithm. During the operation of the
data mining algorithm, the algorithm itself decides which attributes to use and which to ignore.
Algorithms for building decision tree classifiers often operate in this manner.
Filter approaches
Features are selected before the data mining algorithm is run, using some approach that is
independent of the data mining task. For example, we might select sets of attributes whose pair wise
correlation is as low as possible.
Wrapper approaches
These methods use the target data mining algorithm as a black box to find the best subset of
attributes, in a way similar to that of the ideal algorithm described above, but typically without
enumerating all possible subsets. The embedded approaches are algorithm-specific.
It is possible to encompass both the filter and wrapper approaches within a common architecture. The
feature selection process is viewed as consisting of four parts: a measure for evaluating a subset, a
search strategy that controls the generation of a new subset of features, a stopping criterion, and a
validation procedure. The steps are shown in the figure above.
Feature Weighting
Feature weighting is an alternative to keeping or eliminating features. More important features are
assigned a higher weight, while less important features are given a lower weight. These weights are
sometimes assigned based on domain knowledge about the relative importance of features.
Feature Creation
It is frequently possible to create, from the original attributes, a new set of attributes that captures the
important information in a data set much more effectively. Three related methodologies for creating
new attributes are: feature extraction, mapping the data to a new space, and feature construction.
Feature Extraction
The creation of a new set of features from the original raw data is known as feature extraction.
Consider a set of photographs, where each photograph is to be classified according to whether or not
it contains a human face. The raw data is a set of pixels and is not suitable for many types of
classification algorithms. If the data is processed to provide higher level features, such as the presence
or absence of certain types of edges and areas that are highly correlated with the presence of human
faces, then a much broader set of classification techniques can be applied to this problem.
A totally different view of the data can reveal important and interesting features. Consider time series
data, which often contains periodic patterns. If there is only a single periodic pattern and not much
noise, then the pattern is easily detected. If there are a number of periodic patterns and a significant
amount of noise is present, then these patterns are hard to detect. Such patterns can be detected by
applying a Fourier transform to the time series in order to change to a representation in which
frequency information is explicit. Besides the Fourier transform, the wavelet transform is also very
useful for time series and other types of data.
Feature Construction
Sometimes the features in the original data sets have the necessary information, but it is not in a form
suitable for the data mining algorithm. In this situation, one or more new features constructed out of
the original features can be more useful than the original features.
Some data mining algorithms, especially certain classification algorithms, require that the data be in
the form of categorical attributes. Algorithms that find association patterns require that the data be in
the form of binary attributes. Thus, it is often necessary to transform a continuous attribute into a
categorical attribute (discretization), and both continuous and discrete attributes may need to be
transformed into one or more binary attributes (binarization).
Binarization
A simple technique to binarize a categorical attribute is the following: If there are m categorical
values, then uniquely assign each original value to an integer in the interval [0, rn - 1]. If the attribute
is ordinal, then order must be maintained by the assignment. Next, convert each of these m integers to
a binary number.
Discretization is typically applied to attributes that are used in classification or association analysis.
In general, the best discretization depends on the algorithm being used, and other attributes being
considered. The discretization of an attribute is considered in isolation. Transformation of a
continuous attribute to a categorical attribute involves two subtasks: deciding how many categories to
have and determining how to map the values of the continuous attribute to these categories. In the first
step, after the values of the continuous attribute are sorted, they are then divided into n intervals by
specifying n-1 split points. In the second step, all the values in one interval are mapped to the same
categorical value.
Example:
Age: Discretize into three categories: Young, Middle-aged, and Old.
Unsupervised Discretization
A basic distinction between discretization methods for classification is whether class information is
used (supervised) or not (unsupervised).
The equal width approach divides the range of the attribute into a user-specified number of intervals
each having the same width.
Example: If we have a continuous attribute "Age" ranging from 20 to 60 and we want to discretize it
into 4 intervals using equal width approach, each interval will have a width of (60 - 20) / 4 = 10 years.
An equal frequency (equal depth) approach tries to put the same number of objects into each
interval.
Example: If we have a dataset of 100 individuals and we want to discretize their ages into 4 intervals
using equal frequency approach, each interval will contain 100 / 4 = 25 individuals. The intervals
would be created such that each interval contains 25 individuals with approximately equal frequency
distribution.
Supervised Discretization
Using class labels information for discretization often produces better results. A simple approach is to
place the splits in a way that maximizes the purity of the intervals. Some statistically based
approaches start with small intervals and create larger intervals by merging adjacent intervals that are
similar. Entropy based approaches are one of the most promising approaches to discretization.
Example:
we have a dataset containing information about loan applicants, including their income and whether
their loan applications were approved or denied. We want to discretize the income variable into
intervals to predict the likelihood of loan approval.
Let's consider potential split points based on common income thresholds or significant values in the
dataset, such as 40,000, 60,000, and 80,000.
For each potential split point, calculate the entropy of the resulting intervals based on loan approval
labels. We compare the entropies to assess the purity of resulting intervals.
Let's calculate the entropy for each potential split point and evaluate the purity of resulting intervals
based on loan approval labels. Splitting at 40,000:
Interval [0-40,000):
Approved: 2 (Applicants 1, 3)
Denied: 1 (Applicant 2)
Interval [40,000-120,000):
To choose the optimal split point that minimizes entropy, we compare the entropy values calculated
for each potential split point. The split point with the lowest entropy indicates the point that
maximizes the purity or homogeneity of the resulting intervals.
Based on these entropy values, the split point that minimizes entropy and maximizes the purity of
resulting intervals is splitting at 60,000. The entropy value for the interval [0-60,000) is lower
compared to other intervals, indicating higher purity within this interval with respect to loan approval.
These intervals provide a discretized representation of income levels, dividing the applicants into two
distinct categories based on their income levels. This interpretation will be useful for further analysis,
such as predicting loan approval likelihood based on income.
Variable Transformation
A variable transformation refers to a transformation that is applied to all the values of a variable. For
each object, the transformation is applied to the value of the variable for that object. For example, if
only the magnitude of a variable is important, then the values of the variable can be transformed by
taking the absolute value. Two important types of variable transformations are simple functional
transformations and normalization.
Simple Functions
For this type of variable transformation, a simple mathematical function is applied to each value
individually. If x is a variable, examples of various transformations include xk, log x, ex etc.
Normalization or Standardization
Consider comparing people based on two variables: age and income. For any two people, the
difference in income will likely be much higher in absolute terms than the difference in age.
Note: Prepare similarity and dissimilarity measures from class notes (Pearson, cosine, jaaccard,
Euclidean, Manhattan).
-------XXXXXXX-------