04 Data Mining-Applications
04 Data Mining-Applications
8 Summary
33
Invisible data mining: We cannot expect everyone in society to learn and master
data mining techniques. More and more systems should have data mining functions built within so that people can perform data mining or use data mining results
simply by mouse clicking, without any knowledge of data mining algorithms. Intelligent search engines and Internet-based stores perform such invisible data mining by
incorporating data mining into their components to improve their functionality and
performance. This is done often unbeknownst to the user. For example, when purchasing items online, users may be unaware that the store is likely collecting data on
the buying patterns of its customers, which may be used to recommend other items
for purchase in the future.
These issues and many additional ones relating to the research, development, and
application of data mining are discussed throughout the book.
1.8
Summary
Necessity is the mother of invention. With the mounting growth of data in every application, data mining meets the imminent need for effective, scalable, and flexible data
analysis in our society. Data mining can be considered as a natural evolution of information technology and a confluence of several related disciplines and application
domains.
Data mining is the process of discovering interesting patterns from massive amounts
of data. As a knowledge discovery process, it typically involves data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation,
and knowledge presentation.
A pattern is interesting if it is valid on test data with some degree of certainty, novel,
potentially useful (e.g., can be acted on or validates a hunch about which the user was
curious), and easily understood by humans. Interesting patterns represent knowledge. Measures of pattern interestingness, either objective or subjective, can be used
to guide the discovery process.
We present a multidimensional view of data mining. The major dimensions are
data, knowledge, technologies, and applications.
Data mining can be conducted on any kind of data as long as the data are meaningful
for a target application, such as database data, data warehouse data, transactional
data, and advanced data types. Advanced data types include time-related or sequence
data, data streams, spatial and spatiotemporal data, text and multimedia data, graph
and networked data, and Web data.
A data warehouse is a repository for long-term storage of data from multiple sources,
organized so as to facilitate management decision making. The data are stored
under a unified schema and are typically summarized. Data warehouse systems provide multidimensional data analysis capabilities, collectively referred to as online
analytical processing.
27
the major topics in a collection of documents and, for each document in the collection,
the major topics involved.
Increasingly large amounts of text and multimedia data have been accumulated and
made available online due to the fast growth of the Web and applications such as digital libraries, digital governments, and health care information systems. Their effective
search and analysis have raised many challenging issues in data mining. Therefore, text
mining and multimedia data mining, integrated with information retrieval methods,
have become increasingly important.
1.6
1.6.1
Business Intelligence
It is critical for businesses to acquire a better understanding of the commercial context
of their organization, such as their customers, the market, supply and resources, and
competitors. Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations. Examples include reporting, online analytical
processing, business performance management, competitive intelligence, benchmarking, and predictive analytics.
How important is business intelligence? Without data mining, many businesses may
not be able to perform effective market analysis, compare customer feedback on similar products, discover the strengths and weaknesses of their competitors, retain highly
valuable customers, and make smart business decisions.
Clearly, data mining is the core of business intelligence. Online analytical processing tools in business intelligence rely on data warehousing and multidimensional data
mining. Classification and prediction techniques are the core of predictive analytics
in business intelligence, for which there are many applications in analyzing markets,
supplies, and sales. Moreover, clustering plays a central role in customer relationship
management, which groups customers based on their similarities. Using characterization mining techniques, we can better understand features of each customer group and
develop customized customer reward programs.
28
Chapter 1 Introduction
1.6.2
Web crawler is a computer program that browses the Web in a methodical, automated manner.
1.7
29
1.7.1
Mining Methodology
Researchers have been vigorously developing new data mining methodologies. This
involves the investigation of new kinds of knowledge, mining in multidimensional
space, integrating methods from other disciplines, and the consideration of semantic ties
among data objects. In addition, mining methodologies should consider issues such as
data uncertainty, noise, and incompleteness. Some mining methods explore how userspecified measures can be used to assess the interestingness of discovered patterns as
well as guide the discovery process. Lets have a look at these various aspects of mining
methodology.
Mining various and new kinds of knowledge: Data mining covers a wide spectrum of
data analysis and knowledge discovery tasks, from data characterization and discrimination to association and correlation analysis, classification, regression, clustering,
outlier analysis, sequence analysis, and trend and evolution analysis. These tasks may
use the same database in different ways and require the development of numerous
data mining techniques. Due to the diversity of applications, new mining tasks continue to emerge, making data mining a dynamic and fast-growing field. For example,
for effective knowledge discovery in information networks, integrated clustering and
ranking may lead to the discovery of high-quality clusters and object ranks in large
networks.
Mining knowledge in multidimensional space: When searching for knowledge in large
data sets, we can explore the data in multidimensional space. That is, we can search
for interesting patterns among combinations of dimensions (attributes) at varying
levels of abstraction. Such mining is known as (exploratory) multidimensional data
mining. In many cases, data can be aggregated or viewed as a multidimensional data
cube. Mining knowledge in cube space can substantially enhance the power and
flexibility of data mining.
Data miningan interdisciplinary effort: The power of data mining can be substantially enhanced by integrating new methods from multiple disciplines. For example,
30
Chapter 1 Introduction
to mine data with natural language text, it makes sense to fuse data mining methods
with methods of information retrieval and natural language processing. As another
example, consider the mining of software bugs in large programs. This form of mining, known as bug mining, benefits from the incorporation of software engineering
knowledge into the data mining process.
Boosting the power of discovery in a networked environment: Most data objects reside
in a linked or interconnected environment, whether it be the Web, database relations, files, or documents. Semantic links across multiple data objects can be used
to advantage in data mining. Knowledge derived in one set of objects can be used
to boost the discovery of knowledge in a related or semantically linked set of
objects.
Handling uncertainty, noise, or incompleteness of data: Data often contain noise,
errors, exceptions, or uncertainty, or are incomplete. Errors and noise may confuse
the data mining process, leading to the derivation of erroneous patterns. Data cleaning, data preprocessing, outlier detection and removal, and uncertainty reasoning are
examples of techniques that need to be integrated with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining: Not all the patterns generated by data mining processes are interesting. What makes a pattern interesting
may vary from user to user. Therefore, techniques are needed to assess the interestingness of discovered patterns based on subjective measures. These estimate the
value of patterns with respect to a given user class, based on user beliefs or expectations. Moreover, by using interestingness measures or user-specified constraints to
guide the discovery process, we may generate more interesting patterns and reduce
the search space.
1.7.2
User Interaction
The user plays an important role in the data mining process. Interesting areas of research
include how to interact with a data mining system, how to incorporate a users background knowledge in mining, and how to visualize and comprehend data mining results.
We introduce each of these here.
Interactive mining: The data mining process should be highly interactive. Thus, it is
important to build flexible user interfaces and an exploratory mining environment,
facilitating the users interaction with the system. A user may like to first sample a
set of data, explore general characteristics of the data, and estimate potential mining results. Interactive mining should allow users to dynamically change the focus
of a search, to refine mining requests based on returned results, and to drill, dice,
and pivot through the data and knowledge space interactively, dynamically exploring
cube space while mining.
Incorporation of background knowledge: Background knowledge, constraints, rules,
and other information regarding the domain under study should be incorporated
31
into the knowledge discovery process. Such knowledge can be used for pattern
evaluation as well as to guide the search toward interesting patterns.
Ad hoc data mining and data mining query languages: Query languages (e.g., SQL)
have played an important role in flexible searching because they allow users to pose
ad hoc queries. Similarly, high-level data mining query languages or other high-level
flexible user interfaces will give users the freedom to define ad hoc data mining tasks.
This should facilitate specification of the relevant sets of data for analysis, the domain
knowledge, the kinds of knowledge to be mined, and the conditions and constraints
to be enforced on the discovered patterns. Optimization of the processing of such
flexible mining requests is another promising area of study.
Presentation and visualization of data mining results: How can a data mining system
present data mining results, vividly and flexibly, so that the discovered knowledge
can be easily understood and directly usable by humans? This is especially crucial
if the data mining process is interactive. It requires the system to adopt expressive
knowledge representations, user-friendly interfaces, and visualization techniques.
1.7.3