Chapter 1 - What is Data Mining
Chapter 1 - What is Data Mining
Data mining is the process of automatically discovering useful information in large data
repositories. Data mining techniques are deployed to scour large data sets in order to find novel
and useful patterns that might otherwise remain unknown. They also provide the capability to
predict the outcome of a future observation, such as the amount a customer will spend at an
online or a brick-and-mortar store.
Not all information discovery tasks are considered to be data mining. Examples include queries,
e.g., looking up individual records in a database or finding web pages that contain a particular set
of keywords. This is because such tasks can be accomplished through simple interactions with a
database management system or an information retrieval system. These systems rely on traditional
computer science techniques, which include sophisticated indexing structures and query
processing algorithms, for efficiently organizing and retrieving information from large data
repositories. Nonetheless, data mining techniques have been used to enhance the performance of
such systems by improving the quality of the search results based on their relevance to the input
queries.
Data Mining and Knowledge Discovery in Databases
Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall
process of converting raw data into useful information, as shown in Figure 1.1. This process
consists of a series of steps, from data preprocessing to postprocessing of data mining results.
The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites. The purpose
of preprocessing is to transform the raw input data into an appropriate format for subsequent
analysis. The steps involved in data preprocessing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and selecting records and features that
are relevant to the data mining task at hand. Because of the many ways data can be collected and
stored, data preprocessing is perhaps the most laborious and time-consuming step in the overall
knowledge discovery process.
“Closing the loop” is a phrase often used to refer to the process of integrating data mining results
into decision support systems. For example, in business applications, the insights offered by data
mining results can be integrated with campaign management tools so that effective marketing
promotions can be conducted and tested. Such integration requires a postprocessing step to ensure
that only valid and useful results are incorporated into the decision support system. An example
of postprocessing is visualization, which allows analysts to explore the data and the data mining
results from a variety of viewpoints. Hypothesis testing methods can also be applied during
postprocessing to eliminate spurious data mining results.
Motivating Challenges
As mentioned earlier, traditional data analysis techniques have often encountered practical
difficulties in meeting the challenges posed by big data applications. The following are some of
the specific challenges that motivated the development of data mining.
Scalability Because of advances in data generation and collection, data sets with sizes of terabytes,
petabytes, or even exabytes are becoming common. If data mining algorithms are to handle these
massive data sets, they must be scalable. Many data mining algorithms employ special search
strategies to handle exponential search problems. Scalability may also require the implementation
of novel data structures to access individual records in an efficient manner. For instance, out-of
core algorithms may be necessary when processing data sets that cannot fit into main memory.
Scalability can also be improved by using sampling or developing parallel and distributed
algorithms.
High Dimensionality It is now common to encounter data sets with hundreds or thousands of
attributes instead of the handful common a few decades ago. In bioinformatics, progress in
microarray technology has produced gene expression data involving thousands of features. Data
sets with temporal or spatial components also tend to have high dimensionality. For example,
consider a data set that contains measurements of temperature at various locations. If the
temperature measurements are taken repeatedly for an extended period, the number of dimensions
(features) increases in proportion to the number of measurements taken. Traditional data analysis
techniques that were developed for low-dimensional data often do not work well for such high-
dimensional data due to issues such as the curse of dimensionality (to be discussed in Chapter 2).
Also, for some data analysis algorithms, the computational complexity increases rapidly as the
dimensionality (the number of features) increases.
Heterogeneous and Complex Data Traditional data analysis methods often deal with data sets
containing attributes of the same type, either continuous or categorical. As the role of data mining
in business, science, medicine, and other fields has grown, so has the need for techniques that can
handle heterogeneous attributes. Recent years have also seen the emergence of more complex data
objects. Examples of such non-traditional types of data include web and social media data
containing text, hyperlinks, images, audio, and videos; DNA data with sequential and three-
dimensional structure; and climate data that consists of measurements (temperature, pressure, etc.)
at various times and locations on the Earth’s surface. Techniques developed for
mining such complex objects should take into consideration relationships in the data, such as
temporal and spatial autocorrelation, graph connectivity, and parent-child relationships between
the elements in semi-structured text and XML documents.
Data Ownership and Distribution Sometimes, the data needed for an analysis is not stored in
one location or owned by one organization. Instead, the data is geographically distributed among
resources belonging to multiple entities. This requires the development of distributed data mining
techniques. The key challenges faced by distributed data mining algorithms include the following:
(1) how to reduce the amount of communication needed to perform the distributed computation,
(2) how to effectively consolidate the data mining results obtained from multiple sources, and (3)
how to address data security and privacy issues.
Non-traditional Analysis The traditional statistical approach is based on a hypothesize-and-test
paradigm. In other words, a hypothesis is proposed, an experiment is designed to gather the data,
and then the data is analyzed with respect to the hypothesis. Unfortunately, this process is
extremely laborintensive. Current data analysis tasks often require the generation and evaluation
of thousands of hypotheses, and consequently, the development of some data mining techniques
has been motivated by the desire to automate the process of hypothesis generation and evaluation.
Furthermore, the data sets analyzed in data mining are typically not the result of a carefully
designed experiment and often represent opportunistic samples of the data, rather than random
samples.
On the other hand, forecasting the future price of a stock is a regression task because price is a
continuous-valued attribute. The goal of both tasks is to learn a model that minimizes the error
between the predicted and true values of the target variable. Predictive modeling can be used to
identify customers who will respond to a marketing campaign, predict disturbances in the Earth’s
ecosystem, or judge whether a patient has a particular disease based on the results of medical tests.
Example 1.1 (Predicting the Type of a Flower). Consider the task of predicting a species of
flower based on the characteristics of the flower. In particular, consider classifying an Iris flower
as one of the following three Iris species: Setosa, Versicolour, or Virginica. To perform this task,
we need a data set containing the characteristics of various flowers of these three species. A data
set with this type of information is the well-known Iris data set from the UCI Machine Learning
Repository at http://www.ics.uci.edu/∼mlearn. In addition to the species of a flower, this data set
contains four other attributes: sepal width, sepal length, petal length, and petal width. Figure 1.4
shows a plot of petal width versus petal length for the 150 flowers in the Iris data set. Petal width
is broken into the categories low, medium, and high, which correspond to the intervals [0, 0.75),
[0.75, 1.75), [1.75, ∞), respectively. Also, petal length is broken into categories low, medium, and
high, which correspond to the intervals [0, 2.5), [2.5, 5), [5, ∞), respectively. Based on these
categories of petal width and length, the following rules can be derived:
• Petal width low and petal length low implies Setosa.
• Petal width medium and petal length medium implies Versicolour.
• Petal width high and petal length high implies Virginica.
While these rules do not classify all the flowers, they do a good (but not perfect) job of classifying
most of the flowers. Note that flowers from the Setosa species are well separated from the
Versicolour and Virginica species with respect to petal width and length, but the latter two species
overlap somewhat with respect to these attributes.
Association analysis is used to discover patterns that describe strongly associated features in the
data. The discovered patterns are typically represented in the form of implication rules or feature
subsets. Because of the exponential size of its search space, the goal of association analysis is to
extract the most interesting patterns in an efficient manner. Useful applications of association
analysis include finding groups of genes that have related functionality, identifying web pages that
are accessed together, or understanding the relationships between different elements of Earth’s
climate system.
Example 1.2 (Market Basket Analysis). The transactions shown in Table 1.1 illustrate point-of-
sale data collected at the checkout counters of a grocery store. Association analysis can be applied
to find items that are frequently bought together by customers. For example, we may discover the
rule {Diapers} -→ {Milk}, which suggests that customers who buy diapers also tend to buy milk.
This type of rule can be used to identify potential cross-selling opportunities among related items.
Cluster analysis seeks to find groups of closely related observations so that observations that
belong to the same cluster are more similar to each other than observations that belong to other
clusters. Clustering has been used to group sets of related customers, find areas of the ocean that
have a significant impact on the Earth’s climate, and compress data.
Example 1.3 (Document Clustering). The collection of news articles shown in Table 1.2 can be
grouped based on their respective topics. Each article is represented as a set of word-frequency
pairs (w : c), where w is a word and c is the number of times the word appears in the article. There
are two natural clusters in the data set. The first cluster consists of the first four articles, which
correspond to news about the economy, while the second cluster contains the last four articles,
which correspond to news about health care. A good clustering algorithm should be able to identify
these two clusters based on the similarity between words that appear in the articles.
Anomaly detection is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies or outliers. The
goal of an anomaly detection algorithm is to discover the real anomalies and avoid falsely labeling
normal objects as anomalous. In other words, a good anomaly detector must have a high detection
rate and a low false alarm rate. Applications of anomaly detection include the detection of fraud,
network intrusions, unusual patterns of disease, and ecosystem disturbances, such as droughts,
floods, fires, hurricanes, etc.
Example 1.4 (Credit Card Fraud Detection). A credit card company records the transactions
made by every credit card holder, along with personal information such as credit limit, age, annual
income, and address. Since the number of fraudulent cases is relatively small compared to the
number of legitimate transactions, anomaly detection techniques can be applied to build a profile
of legitimate transactions for the users. When a new transaction arrives, it is compared against the
profile of the user. If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.