Data Mine
Data Mine
CONTENTS
PAGE
1. Introduction 1
• Data Selection
• Data Transformation
• Data Mining
• Result Interpretation
8. Conclusion 13
1
INTRODUCTION:
2
to queries. A data warehouse is designed especially for
decision support queries; therefore only data that is needed
for decision support is extracted from the operational data
and stored in the warehouse.
Data Mining is the "automatic" extraction of patterns of
information from historical data, enabling companies to focus
on the most important aspects of their business -- telling
them what they did not know and had not even thought of
asking.
3
Many organizations now view information as one of their
most valuable assets and data mining allows a company to
make full use of these information assets.
DATA MINING:
4
Data selection:
A data warehouse contains a variety of data, not all of
which is needed to achieve each data mining goal. The first
step in the data mining process is to select the target data.
For example, marketing databases contain data describing
customer purchases, demographics, and lifestyle
preferences. To identify which items and quantities to
purchase for a particular store, as well as how to organize
the items on the store shelves, a marketing executive might
need only to combine customer purchase data with
demographic data. The selected data types may be
organized along multiple tables; during data selection, the
user might need to perform table joins. Furthermore even
after selecting the desired data base tables, mining the
contents of the entire table are not always necessary for
identifying useful information. Under certain conditions and
for certain types of data mining operations it is usually a less
expensive operation to sample the appropriate table, which
might have been created by joining other tables, and then
mine only the sample.
Data transformation:
After selecting the desired data base tables and
identifying the data to be mined, the user typically needs to
perform certain transformations on the data. Three
considerations dictate which transformation to use: the task
5
(mailing list creation, for e.g.), the data mining operations
(such as predictive modeling), and the data mining
technique (such as neural networks) involved.
Transformation methods include organizing data in desired
ways (organizing individual consumer data by household),
and converting one type of data to another. Another
transformation type, the definition of new attributes (derived
attributes), involves applying mathematical or logical
operators on the values of one or more database attributes-
for e.g., by defining the ratio of two attributes.
Data mining:
The user subsequently mines the transformed data
using one or more techniques to extract the desired type of
information. For example, to develop an accurate, symbolic
classification model that predicts whether magazine
subscribers will renew their subscriptions, a circulations
manager might need to first use clustering to segment the
subscriber database, and then apply rule induction to
automatically create a classification model for each desired
cluster.
Result interpretation:
The user must finally analyze the mined information
according to his decision support or goals. Such analysis
identifies the best of the information. For example, if a
classification model has been developed, during result
interpretation, the data-mining application will test the
models
Robustness, using established error-estimation methods such
as cross validation. During this step, the user must also
determine how best to present the selected mining operation
results to the decision maker, who will apply them in taking
specific actions.
6
1) Mining is only one step in the overall process. The
quality of the mined information is a function of both the
effectiveness of the data mining technique used and the
quality, and often size of the data being mined. If users
select the wrong data, choose inappropriate attributes, or
transform the selected data inappropriately, the results will
likely suffer.
2) The process is not linear but involves a variety of
feedback loops. After selecting a particular data mining
technique, a user might determine that the selected data
must be preprocessed in particular ways or that the applied
technique did not produce results of the expected quality.
The user then must repeat earlier steps, which might mean
restarting the entire process from the beginning.
3) Visualization plays an important role in the various
steps. In particular, during the selection and transformation
steps, a user could use statistical visualizations such as
scatter plots or histograms to display the results of
exploratory data analysis. Such exploratory analysis often
provides preliminary understanding of the data, which helps
the user select certain data subsets. During the mining step,
the user employs domain specific visualizations.
DATA MINING OPERATIONS:
There are two main operations associated with data
mining.
7
hypothesis postulated by the user, such as "sales of four
wheel drive vehicles increase during the winter". Validating a
hypothesis through a query and reporting operation entails
creating a query, or a set of queries, that best expresses the
stated hypothesis, posing the query to the database, and
analyzing the returned data to establish whether it supports
or refuses the hypothesis. Each data interpretation or
analysis step might lead to additional queries, either new
ones or refinements of the initial one. Reports subsequently
compiled for distribution throughout an organization contain
selected analysis results, presented in graphical, tabular, and
textual form. Because these reports include the queries,
analysis can be automatically repeated at predefined times,
such as once a month.
Multidimensional analysis:
Multidimensional spreadsheets and databases are
becoming popular for data analysis that requires summary
views of the data along multiple dimensions.
Multidimensional databases, often implemented as
multidimensional arrays, organize data along predefined
dimensions. These databases also allow hierarchical
organization of the data along each dimension, with
summaries on the higher levels of the hierarchy and the
actual data at the lower levels. Data mining technologies
perform automatic analysis that can help enhance the value
of the data exploration supported by multidimensional tools.
Statistical analysis:
Simple statistical analysis operations usually execute
during both query and reporting, as well as during
multidimensional analysis. Several statistical analysis tools
(SAS, SPSS) incorporate components that can be used for
discovery-driven modeling.
8
•Predictive modeling
•Clustering
•Frequency pattern extraction
•Deviation detection
Predictive modeling:
This is based on techniques used for classification and
regression modeling. One field in the tabular data set is pre-
identified as the response or class variable. The algorithm
produces a model for that variable as a function of other
fields in the data set, pre-identified as features or
explanatory variables. If the response variable is discrete-
valued, then classification modeling is employed. If the
response variable is continuous valued, then regression
modeling is employed. The principal issue addressed by this
algorithm is to produce a predictively accurate function
approximation for the response variable by using the data
set as an example relation between instances of explanatory
variables and the response variable, in the presence of noise.
Once produced, and given the specification for the
explanatory variable, the model can be used to predict the
value of the response variable.
Clustering:
Clustering constitutes a major class of data mining
algorithm. First, an appropriate subset is selected for data
mining. Then the data is cleaned to remove noise. Using the
same tabular data model described earlier, the algorithm
attempts to automatically partition the data space into a set
of regions or clusters, to which examples in the table are
assigned, either deterministically or probability wise. The
goal of the search process used by this algorithm is to
identify all sets of similar examples in the data, in some
optional fashion.
9
exist in the data, with some predefined level of regularities.
Typically, the basic pattern to be extracted is an association-
a tuple of two sets, with a unidirectional casual implication
between the two sets denoted by A->B. Attached with this
tuple are two statistical measures: confidence & support.
Confidence measures the fraction of times B exists in the
data set when A is present; while support measures the
number of times A exists as a fraction of the total data. Thus,
the association with a very high support and confidence is a
pattern that occurs so often in the data that it should be
obvious to the end user. Patterns with extremely low support
and confidence should be regarded of no significance. Only
patterns with combinations of intermediate values for
confidence and support provide the user with interesting and
previously unknown information. Many variations of this basic
association pattern have been formulated, with algorithms
there to extract them. Temporal relations that are extracted
may hold significance, either in terms of frequency
occurrences or frequent matching among groups or
temporary patterns.
Deviation detection:
This operation attempts to identify points that cannot
be fitted into a segment and then explain whether each such
point is noise or should be examined in more detail. This
operation usually operates in conjunction with data base
segmentation and, because "outliers" express deviation from
expected norms, often leads to true discovery.
CHARACTERISTICS OF DATA MINING:
In 70% of the applications, users perform data mining
using verification driven operations. Data analysts and
business analysts alike thoroughly understand query and
reporting, multidimensional analysis, and statistical analysis.
Neural and symbolic induction methods have only been
recently developed.
Two factors have inhibited the broad deployment of
applications that incorporate discovery driven data mining
techniques: the significant effort necessary to develop each
10
data mining application and the inappropriate state of the
data that an application must mine.
Application Development:
Most deployed data mining applications are not
developed by business analysts but through the collaboration
of data mining tool vendors, data analysts, and end users.
Because the tool vendors and data analysts usually first must
develop an understanding of the end users problem, such
collaborations are time consuming. Furthermore, the current
generation of data mining tools is aimed at the data analyst,
not the business analyst.
Data:
Data Systems rely on data bases to supply raw data for
input, and this raises the problems in that data bases tend to
become dynamic, but incomplete, noisy and large. Other
problems include the inadequacy and irrelevance of the
information stored. In all, the problems can be categorized
as:
• Limited information
• Noise or missing data
• Uncertainty, and
• Size, updates, and irrelevant fields
Limited Information:
A database is often designed for purposes different from
data mining, and sometimes the properties or attributes that
would simplify the learning task are neither present nor can
be requested from the real world. Inconclusive data causes
problems because if some attributes essential to knowledge
about the application domain are not present in the data, it
may be impossible to discover significant knowledge about a
given domain. For example, one cannot diagnose malaria
from a patient database if that database does not contain
the patient's red blood cell count.
Noise and missing value:
11
Databases are usually contaminated with errors, so it
cannot be assumed that the data they contain is entirely
correct. Attributes that rely on subjective or measurement
judgment can give rise to errors such that some examples
may even be misclassified. Errors either in values of
attributes or class information are known as noise. Obviously,
it is desirable to eliminate noise from the classification
information as it affects the overall accuracy of the
information.
Missing data can be treated by discovering systems in a
number of ways such as: simply disregarding missing values,
omitting corresponding records, inferring missing values
from known values, and treating the missing data as a
special value to be included additionally in the attribute
domain.
Uncertainty:
This refers to the severity of the error and the degree of
noise in the data. Data precision is an important
consideration in a discovery system.
Sizes, Updates and Irrelevant fields:
Data bases tend to be large and dynamic in that their
contents are ever changing as information is added, modified
or removed. The problem with this, from the data mining
perspective, is how to ensure that rules are up to date and
consistent with the most current information.
APPLICATIONS OF DATA MINING:
Predictive modeling techniques are best used when a
large body of historical data is available. This data is used to
model a variable of interest, so that this variable may be
forecast in future scenarios, and effective actions taken
based on that forecast. Some examples are as follows:
Risk Analysis:
Given a set of current customers and an assessment of
their risk worthiness, develop descriptions for various
12
classes. Use these descriptions to classify a new customer
into one of the risk categories.
Targeted Marketing:
Given a data base of potential customers and how they
have responded to a solicitation, develop a model of
customers most likely to respond positively. Use the model
for more focused new customer solicitation.
Customer retention:
Given a database of past customers and their behavior
prior to attribution, develop a model of customers most likely
to leave. Use the model for determining the best course of
action for these customers.
Portfolio management:
Given a particular financial asset, predict the return on
investment to determine whether to include the asset in a
folio or not.
Brand loyalty:
Given a particular customer and a product he or she
uses, predict whether the customer will switch brand.
Using frequent patterns extracted from data one can
build link analysis and item set analysis applications. These
are used for determining business values of promotional
effectiveness, analyzing subscriber services, forecasting
demand, etc. One of the more powerful applications of this
technique is market basket analysis, in which databases of
sales transactions are examined to extract patterns that
identify what items sell together, what items sell better when
relocated to new areas, and what product groupings improve
department sales. Clustering approaches are one of the
more pervasive applications of data mining. As databases
grow, it is often necessary to partition them into collection of
related records to obtain better summaries of the
subpopulations present in the data.
Another area of new application is the study and
analysis of Internet traffic. Just as sales and bank data could
13
be mined to help the retail store or bank improves its
products and marketing, Internet traffic on a web site can be
analyzed to better understand where the real demand is,
what pages are being looked at collectively, and so on.
Service providers can use this information to better organize
their web pages.
CONCLUSION:
Data mining is becoming an integral part of the
operations in organizations of varying sizes. Many
organizations that only recently have begun analyzing their
data have started to successfully use applications employing
verification-driven data mining techniques. Applications
using discovery-driven techniques are also finding increased
use. While many of the deployed applications primarily
employ predictive modeling techniques, application
developers and end users alike are beginning to recognize
the need to use additional techniques from the discovery-
driven data mining repertory. Applications with broad market
appeal such as market basket analysis and customer
segmentation have successfully demonstrated the
advantages of using such techniques.
14