0% found this document useful (0 votes)
2 views

Chapter 1 - What is Data Mining

Uploaded by

hanhntm22414c
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 1 - What is Data Mining

Uploaded by

hanhntm22414c
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

What Is Data Mining?

Data mining is the process of automatically discovering useful information in large data
repositories. Data mining techniques are deployed to scour large data sets in order to find novel
and useful patterns that might otherwise remain unknown. They also provide the capability to
predict the outcome of a future observation, such as the amount a customer will spend at an
online or a brick-and-mortar store.
Not all information discovery tasks are considered to be data mining. Examples include queries,
e.g., looking up individual records in a database or finding web pages that contain a particular set
of keywords. This is because such tasks can be accomplished through simple interactions with a
database management system or an information retrieval system. These systems rely on traditional
computer science techniques, which include sophisticated indexing structures and query
processing algorithms, for efficiently organizing and retrieving information from large data
repositories. Nonetheless, data mining techniques have been used to enhance the performance of
such systems by improving the quality of the search results based on their relevance to the input
queries.
Data Mining and Knowledge Discovery in Databases
Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall
process of converting raw data into useful information, as shown in Figure 1.1. This process
consists of a series of steps, from data preprocessing to postprocessing of data mining results.

The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites. The purpose
of preprocessing is to transform the raw input data into an appropriate format for subsequent
analysis. The steps involved in data preprocessing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and selecting records and features that
are relevant to the data mining task at hand. Because of the many ways data can be collected and
stored, data preprocessing is perhaps the most laborious and time-consuming step in the overall
knowledge discovery process.
“Closing the loop” is a phrase often used to refer to the process of integrating data mining results
into decision support systems. For example, in business applications, the insights offered by data
mining results can be integrated with campaign management tools so that effective marketing
promotions can be conducted and tested. Such integration requires a postprocessing step to ensure
that only valid and useful results are incorporated into the decision support system. An example
of postprocessing is visualization, which allows analysts to explore the data and the data mining
results from a variety of viewpoints. Hypothesis testing methods can also be applied during
postprocessing to eliminate spurious data mining results.

Motivating Challenges
As mentioned earlier, traditional data analysis techniques have often encountered practical
difficulties in meeting the challenges posed by big data applications. The following are some of
the specific challenges that motivated the development of data mining.
Scalability Because of advances in data generation and collection, data sets with sizes of terabytes,
petabytes, or even exabytes are becoming common. If data mining algorithms are to handle these
massive data sets, they must be scalable. Many data mining algorithms employ special search
strategies to handle exponential search problems. Scalability may also require the implementation
of novel data structures to access individual records in an efficient manner. For instance, out-of
core algorithms may be necessary when processing data sets that cannot fit into main memory.
Scalability can also be improved by using sampling or developing parallel and distributed
algorithms.
High Dimensionality It is now common to encounter data sets with hundreds or thousands of
attributes instead of the handful common a few decades ago. In bioinformatics, progress in
microarray technology has produced gene expression data involving thousands of features. Data
sets with temporal or spatial components also tend to have high dimensionality. For example,
consider a data set that contains measurements of temperature at various locations. If the
temperature measurements are taken repeatedly for an extended period, the number of dimensions
(features) increases in proportion to the number of measurements taken. Traditional data analysis
techniques that were developed for low-dimensional data often do not work well for such high-
dimensional data due to issues such as the curse of dimensionality (to be discussed in Chapter 2).
Also, for some data analysis algorithms, the computational complexity increases rapidly as the
dimensionality (the number of features) increases.
Heterogeneous and Complex Data Traditional data analysis methods often deal with data sets
containing attributes of the same type, either continuous or categorical. As the role of data mining
in business, science, medicine, and other fields has grown, so has the need for techniques that can
handle heterogeneous attributes. Recent years have also seen the emergence of more complex data
objects. Examples of such non-traditional types of data include web and social media data
containing text, hyperlinks, images, audio, and videos; DNA data with sequential and three-
dimensional structure; and climate data that consists of measurements (temperature, pressure, etc.)
at various times and locations on the Earth’s surface. Techniques developed for
mining such complex objects should take into consideration relationships in the data, such as
temporal and spatial autocorrelation, graph connectivity, and parent-child relationships between
the elements in semi-structured text and XML documents.
Data Ownership and Distribution Sometimes, the data needed for an analysis is not stored in
one location or owned by one organization. Instead, the data is geographically distributed among
resources belonging to multiple entities. This requires the development of distributed data mining
techniques. The key challenges faced by distributed data mining algorithms include the following:
(1) how to reduce the amount of communication needed to perform the distributed computation,
(2) how to effectively consolidate the data mining results obtained from multiple sources, and (3)
how to address data security and privacy issues.
Non-traditional Analysis The traditional statistical approach is based on a hypothesize-and-test
paradigm. In other words, a hypothesis is proposed, an experiment is designed to gather the data,
and then the data is analyzed with respect to the hypothesis. Unfortunately, this process is
extremely laborintensive. Current data analysis tasks often require the generation and evaluation
of thousands of hypotheses, and consequently, the development of some data mining techniques
has been motivated by the desire to automate the process of hypothesis generation and evaluation.
Furthermore, the data sets analyzed in data mining are typically not the result of a carefully
designed experiment and often represent opportunistic samples of the data, rather than random
samples.

The Origins of Data Mining


While data mining has traditionally been viewed as an intermediate process within the KDD
framework, as shown in Figure 1.1, it has emerged over the years as an academic field within
computer science, focusing on all aspects of KDD, including data preprocessing, mining, and
postprocessing. Its origin can be traced back to the late 1980s, following a series of workshops
organized on the topic of knowledge discovery in databases. The workshops brought together
researchers from different disciplines to discuss the challenges and opportunities in applying
computational techniques to extract actionable knowledge from large databases. The workshops
quickly grew into hugely popular conferences that were attended by researchers and practitioners
from both the academia and industry. The success of these conferences, along with the interest
shown by businesses and industry in recruiting new hires with a data mining background, have
fueled the tremendous growth of this field.
The field was initially built upon the methodology and algorithms that researchers had previously
used. In particular, data mining researchers draw upon ideas, such as (1) sampling, estimation, and
hypothesis testing from statistics and (2) search algorithms, modeling techniques, and learning
theories from artificial intelligence, pattern recognition, and machine learning. Data mining has
also been quick to adopt ideas from other areas, including optimization, evolutionary computing,
information theory, signal processing, visualization, and information retrieval, and extending them
to solve the challenges of mining big data.
A number of other areas also play key supporting roles. In particular, database systems are needed
to provide support for efficient storage, indexing, and query processing. Techniques from high
performance (parallel) computing are often important in addressing the massive size of some data
sets. Distributed techniques can also help address the issue of size and are essential when the data
cannot be gathered in one location. Figure 1.2 shows the relationship of data mining to other areas.

Data Science and Data-Driven Discovery


Data science is an interdisciplinary field that studies and applies tools and techniques for deriving
useful insights from data. Although data science is regarded as an emerging field with a distinct
identity of its own, the tools and techniques often come from many different areas of data analysis,
such as data mining, statistics, AI, machine learning, pattern recognition, database technology, and
distributed and parallel computing. (See Figure 1.2.)
The emergence of data science as a new field is a recognition that, often, none of the existing areas
of data analysis provides a complete set of tools for the data analysis tasks that are often
encountered in emerging applications. Instead, a broad range of computational, mathematical, and
statistical skills is often required. To illustrate the challenges that arise in analyzing such data,
consider the following example. Social media and the Web present new opportunities for social
scientists to observe and quantitatively measure human behavior on a large scale. To conduct such
a study, social scientists work with analysts who possess skills in areas such as web mining, natural
language processing (NLP), network analysis, data mining, and statistics. Compared to more
traditional research in social science, which is often based on surveys, this analysis requires a
broader range of skills and tools, and involves far larger amounts of data. Thus, data science is, by
necessity, a highly interdisciplinary field that builds on the continuing work of many fields.
The data-driven approach of data science emphasizes the direct discovery of patterns and
relationships from data, especially in large quantities of data, often without the need for extensive
domain knowledge. A notable example of the success of this approach is represented by advances
in neural networks, i.e., deep learning, which have been particularly successful in areas which have
long proved challenging, e.g., recognizing objects in photos or videos and words in speech, as well
as in other application areas. However, note that this is just one example of the success of data-
driven approaches, and dramatic improvements have also occurred in many other areas of data
analysis. Many of these developments are topics described later in this book.

Data Mining Tasks


Data mining tasks are generally divided into two major categories:
Predictive tasks The objective of these tasks is to predict the value of a particular attribute based
on the values of other attributes. The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the prediction are known as the
explanatory or independent variables.
Descriptive tasks Here, the objective is to derive patterns (correlations, trends, clusters,
trajectories, and anomalies) that summarize the underlying relationships in data. Descriptive data
mining tasks are often exploratory in nature and frequently require postprocessing techniques to
validate and explain the results. Figure 1.3 illustrates four of the core data mining tasks that are
described in the remainder of this book.
Predictive modeling refers to the task of building a model for the target variable as a function of
the explanatory variables. There are two types of predictive modeling tasks: classification, which
is used for discrete target variables, and regression, which is used for continuous target variables.
For example, predicting whether a web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued.

On the other hand, forecasting the future price of a stock is a regression task because price is a
continuous-valued attribute. The goal of both tasks is to learn a model that minimizes the error
between the predicted and true values of the target variable. Predictive modeling can be used to
identify customers who will respond to a marketing campaign, predict disturbances in the Earth’s
ecosystem, or judge whether a patient has a particular disease based on the results of medical tests.
Example 1.1 (Predicting the Type of a Flower). Consider the task of predicting a species of
flower based on the characteristics of the flower. In particular, consider classifying an Iris flower
as one of the following three Iris species: Setosa, Versicolour, or Virginica. To perform this task,
we need a data set containing the characteristics of various flowers of these three species. A data
set with this type of information is the well-known Iris data set from the UCI Machine Learning
Repository at http://www.ics.uci.edu/∼mlearn. In addition to the species of a flower, this data set
contains four other attributes: sepal width, sepal length, petal length, and petal width. Figure 1.4
shows a plot of petal width versus petal length for the 150 flowers in the Iris data set. Petal width
is broken into the categories low, medium, and high, which correspond to the intervals [0, 0.75),
[0.75, 1.75), [1.75, ∞), respectively. Also, petal length is broken into categories low, medium, and
high, which correspond to the intervals [0, 2.5), [2.5, 5), [5, ∞), respectively. Based on these
categories of petal width and length, the following rules can be derived:
• Petal width low and petal length low implies Setosa.
• Petal width medium and petal length medium implies Versicolour.
• Petal width high and petal length high implies Virginica.
While these rules do not classify all the flowers, they do a good (but not perfect) job of classifying
most of the flowers. Note that flowers from the Setosa species are well separated from the
Versicolour and Virginica species with respect to petal width and length, but the latter two species
overlap somewhat with respect to these attributes.
Association analysis is used to discover patterns that describe strongly associated features in the
data. The discovered patterns are typically represented in the form of implication rules or feature
subsets. Because of the exponential size of its search space, the goal of association analysis is to
extract the most interesting patterns in an efficient manner. Useful applications of association
analysis include finding groups of genes that have related functionality, identifying web pages that
are accessed together, or understanding the relationships between different elements of Earth’s
climate system.
Example 1.2 (Market Basket Analysis). The transactions shown in Table 1.1 illustrate point-of-
sale data collected at the checkout counters of a grocery store. Association analysis can be applied
to find items that are frequently bought together by customers. For example, we may discover the
rule {Diapers} -→ {Milk}, which suggests that customers who buy diapers also tend to buy milk.
This type of rule can be used to identify potential cross-selling opportunities among related items.

Cluster analysis seeks to find groups of closely related observations so that observations that
belong to the same cluster are more similar to each other than observations that belong to other
clusters. Clustering has been used to group sets of related customers, find areas of the ocean that
have a significant impact on the Earth’s climate, and compress data.
Example 1.3 (Document Clustering). The collection of news articles shown in Table 1.2 can be
grouped based on their respective topics. Each article is represented as a set of word-frequency
pairs (w : c), where w is a word and c is the number of times the word appears in the article. There
are two natural clusters in the data set. The first cluster consists of the first four articles, which
correspond to news about the economy, while the second cluster contains the last four articles,
which correspond to news about health care. A good clustering algorithm should be able to identify
these two clusters based on the similarity between words that appear in the articles.
Anomaly detection is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies or outliers. The
goal of an anomaly detection algorithm is to discover the real anomalies and avoid falsely labeling
normal objects as anomalous. In other words, a good anomaly detector must have a high detection
rate and a low false alarm rate. Applications of anomaly detection include the detection of fraud,
network intrusions, unusual patterns of disease, and ecosystem disturbances, such as droughts,
floods, fires, hurricanes, etc.
Example 1.4 (Credit Card Fraud Detection). A credit card company records the transactions
made by every credit card holder, along with personal information such as credit limit, age, annual
income, and address. Since the number of fraudulent cases is relatively small compared to the
number of legitimate transactions, anomaly detection techniques can be applied to build a profile
of legitimate transactions for the users. When a new transaction arrives, it is compared against the
profile of the user. If the characteristics of the transaction are very different from the previously
created profile, then the transaction is flagged as potentially fraudulent.

You might also like