0% found this document useful (0 votes)
95 views

Data Mine

This document provides an overview of data mining. It discusses that data mining is the process of extracting useful patterns from large data sets. The key steps in data mining include data selection, transformation, mining, and interpretation. Data mining can be used to discover hidden patterns and relationships that can help businesses make better decisions. It allows organizations to focus on the most important information. The document also discusses different types of data mining operations and applications of data mining across various industries.

Uploaded by

rajeshbubble
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Data Mine

This document provides an overview of data mining. It discusses that data mining is the process of extracting useful patterns from large data sets. The key steps in data mining include data selection, transformation, mining, and interpretation. Data mining can be used to discover hidden patterns and relationships that can help businesses make better decisions. It allows organizations to focus on the most important information. The document also discusses different types of data mining operations and applications of data mining across various industries.

Uploaded by

rajeshbubble
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

DATA MINING

CONTENTS
PAGE

1. Introduction 1

2. Data Mining Basics 3

3. The Data Mining Process


3

• Data Selection
• Data Transformation
• Data Mining
• Result Interpretation

4. Observations Emerged from


6
Data Mining Process

5. Data Mining Operations 6

• Verification-Driven Data Mining


• Discovery-Driven Data Mining

6. Characteristics of Data Mining 9

7. Applications of Data Mining 11

8. Conclusion 13

1
INTRODUCTION:

Data mining is emerging as a key technology for


enterprises that wish to improve the quality of their decision
making and competitive advantage by exploiting operational
and other available data. Data mining, the extraction of
hidden predictive information from large databases, is a
powerful new technology with great potential to help
companies focus on the most important information in their
data warehouse.
Modern organizations respond quickly to changes in the
market. Clearly, in order to do this we need rapid access to
all kinds of information before we can make any logical
decisions. To assist in making the right choices for the
organization, it is essential to be able to research the past
and identify relevant trends. Obviously, in order to perform
any trend analysis we must have access to all the
information needed to support us, and this information is
mainly stored in very large databases. The easiest way to
gain access to this data and facilitate effective decision
making is to set up a data warehouse.
The application of data mining techniques can be
carried out from the existing data warehouse the part of the
information that is of interest to the end user. Trying to mine
operational data is almost impossible because there are
different types of attributes and different data types but no
historical data. With a data warehouse this problem does not
exist as all the information has been transferred from the
operational database to the data warehouse.
In most organizations we will find really large
databases in operation for normal daily transactions. These
types of databases are known as operational databases.
These are not designed to store historical data or to respond

2
to queries. A data warehouse is designed especially for
decision support queries; therefore only data that is needed
for decision support is extracted from the operational data
and stored in the warehouse.
Data Mining is the "automatic" extraction of patterns of
information from historical data, enabling companies to focus
on the most important aspects of their business -- telling
them what they did not know and had not even thought of
asking.

Industry surveys clearly indicate that over 80% of Fortune


500 companies view data mining as a critical factor for
business success by the year 2000. Most such companies
now collect and refine massive quantities of data in data
warehouses.

These companies realize that to succeed in a fast paced


world, business users need to be able to get information on
demand. And, they need to be pleasantly surprised by
unexpected, but useful, information. There is never enough
time to think of all the important questions -- the computer
should do this itself. It can provide the winning edge in
business by exploring the database itself and brings back
invaluable nuggets of information.

3
Many organizations now view information as one of their
most valuable assets and data mining allows a company to
make full use of these information assets.

DATA MINING:

It is the process of extracting valid, previously unknown,


comprehensible, and actionable information from large
databases and using it to make crucial business decisions.
The crux of the appeal for this new technology lies in the
data analysis algorithms, since they provide automated
mechanisms for sifting through data and extracting useful
information.

DATA MINING BASICS:

To be effective, a data mining application must do three


things. First, it must have access to organization-wide use of
data, instead of department-specific ones. Frequently, the
organizations data is supplemented with open source or
purchased data. The resulting database is called the data
warehouse. During data indication, the application often
cleans the data by removing duplicates, deriving missing
values (when possible). Second, the data mining application
must mine the information in the warehouse. Finally, it must
organize and present the mined information in a way that
enables decision making.

THE DATA MINING PROCESS:

Once the data warehouse has been developed, the data


mining process falls into four basic steps: data selection,
data transformation, data mining and result interpretation.

4
Data selection:
A data warehouse contains a variety of data, not all of
which is needed to achieve each data mining goal. The first
step in the data mining process is to select the target data.
For example, marketing databases contain data describing
customer purchases, demographics, and lifestyle
preferences. To identify which items and quantities to
purchase for a particular store, as well as how to organize
the items on the store shelves, a marketing executive might
need only to combine customer purchase data with
demographic data. The selected data types may be
organized along multiple tables; during data selection, the
user might need to perform table joins. Furthermore even
after selecting the desired data base tables, mining the
contents of the entire table are not always necessary for
identifying useful information. Under certain conditions and
for certain types of data mining operations it is usually a less
expensive operation to sample the appropriate table, which
might have been created by joining other tables, and then
mine only the sample.
Data transformation:
After selecting the desired data base tables and
identifying the data to be mined, the user typically needs to
perform certain transformations on the data. Three
considerations dictate which transformation to use: the task

5
(mailing list creation, for e.g.), the data mining operations
(such as predictive modeling), and the data mining
technique (such as neural networks) involved.
Transformation methods include organizing data in desired
ways (organizing individual consumer data by household),
and converting one type of data to another. Another
transformation type, the definition of new attributes (derived
attributes), involves applying mathematical or logical
operators on the values of one or more database attributes-
for e.g., by defining the ratio of two attributes.
Data mining:
The user subsequently mines the transformed data
using one or more techniques to extract the desired type of
information. For example, to develop an accurate, symbolic
classification model that predicts whether magazine
subscribers will renew their subscriptions, a circulations
manager might need to first use clustering to segment the
subscriber database, and then apply rule induction to
automatically create a classification model for each desired
cluster.
Result interpretation:
The user must finally analyze the mined information
according to his decision support or goals. Such analysis
identifies the best of the information. For example, if a
classification model has been developed, during result
interpretation, the data-mining application will test the
models
Robustness, using established error-estimation methods such
as cross validation. During this step, the user must also
determine how best to present the selected mining operation
results to the decision maker, who will apply them in taking
specific actions.

Three observations emerge from this four-step


process:

6
1) Mining is only one step in the overall process. The
quality of the mined information is a function of both the
effectiveness of the data mining technique used and the
quality, and often size of the data being mined. If users
select the wrong data, choose inappropriate attributes, or
transform the selected data inappropriately, the results will
likely suffer.
2) The process is not linear but involves a variety of
feedback loops. After selecting a particular data mining
technique, a user might determine that the selected data
must be preprocessed in particular ways or that the applied
technique did not produce results of the expected quality.
The user then must repeat earlier steps, which might mean
restarting the entire process from the beginning.
3) Visualization plays an important role in the various
steps. In particular, during the selection and transformation
steps, a user could use statistical visualizations such as
scatter plots or histograms to display the results of
exploratory data analysis. Such exploratory analysis often
provides preliminary understanding of the data, which helps
the user select certain data subsets. During the mining step,
the user employs domain specific visualizations.
DATA MINING OPERATIONS:
There are two main operations associated with data
mining.

Verification-driven data mining operations:


Verification-driven data mining extracts information in
the process of validating a hypothesis postulated by a user.
These include query and reporting, multidimensional analysis
and statistical analysis.

Query and reporting:


This operation constitutes the most basic form of
decision support and data mining. Its goal is to validate a

7
hypothesis postulated by the user, such as "sales of four
wheel drive vehicles increase during the winter". Validating a
hypothesis through a query and reporting operation entails
creating a query, or a set of queries, that best expresses the
stated hypothesis, posing the query to the database, and
analyzing the returned data to establish whether it supports
or refuses the hypothesis. Each data interpretation or
analysis step might lead to additional queries, either new
ones or refinements of the initial one. Reports subsequently
compiled for distribution throughout an organization contain
selected analysis results, presented in graphical, tabular, and
textual form. Because these reports include the queries,
analysis can be automatically repeated at predefined times,
such as once a month.
Multidimensional analysis:
Multidimensional spreadsheets and databases are
becoming popular for data analysis that requires summary
views of the data along multiple dimensions.
Multidimensional databases, often implemented as
multidimensional arrays, organize data along predefined
dimensions. These databases also allow hierarchical
organization of the data along each dimension, with
summaries on the higher levels of the hierarchy and the
actual data at the lower levels. Data mining technologies
perform automatic analysis that can help enhance the value
of the data exploration supported by multidimensional tools.
Statistical analysis:
Simple statistical analysis operations usually execute
during both query and reporting, as well as during
multidimensional analysis. Several statistical analysis tools
(SAS, SPSS) incorporate components that can be used for
discovery-driven modeling.

Discovery-driven data mining operations:


There are four types of data mining algorithms, depending
on the kind of information extracted. They are:

8
•Predictive modeling
•Clustering
•Frequency pattern extraction
•Deviation detection
Predictive modeling:
This is based on techniques used for classification and
regression modeling. One field in the tabular data set is pre-
identified as the response or class variable. The algorithm
produces a model for that variable as a function of other
fields in the data set, pre-identified as features or
explanatory variables. If the response variable is discrete-
valued, then classification modeling is employed. If the
response variable is continuous valued, then regression
modeling is employed. The principal issue addressed by this
algorithm is to produce a predictively accurate function
approximation for the response variable by using the data
set as an example relation between instances of explanatory
variables and the response variable, in the presence of noise.
Once produced, and given the specification for the
explanatory variable, the model can be used to predict the
value of the response variable.
Clustering:
Clustering constitutes a major class of data mining
algorithm. First, an appropriate subset is selected for data
mining. Then the data is cleaned to remove noise. Using the
same tabular data model described earlier, the algorithm
attempts to automatically partition the data space into a set
of regions or clusters, to which examples in the table are
assigned, either deterministically or probability wise. The
goal of the search process used by this algorithm is to
identify all sets of similar examples in the data, in some
optional fashion.

Frequency pattern extraction:


In this algorithm the goal is to extract from the tabular
data model all combinations of variable instantiations that

9
exist in the data, with some predefined level of regularities.
Typically, the basic pattern to be extracted is an association-
a tuple of two sets, with a unidirectional casual implication
between the two sets denoted by A->B. Attached with this
tuple are two statistical measures: confidence & support.
Confidence measures the fraction of times B exists in the
data set when A is present; while support measures the
number of times A exists as a fraction of the total data. Thus,
the association with a very high support and confidence is a
pattern that occurs so often in the data that it should be
obvious to the end user. Patterns with extremely low support
and confidence should be regarded of no significance. Only
patterns with combinations of intermediate values for
confidence and support provide the user with interesting and
previously unknown information. Many variations of this basic
association pattern have been formulated, with algorithms
there to extract them. Temporal relations that are extracted
may hold significance, either in terms of frequency
occurrences or frequent matching among groups or
temporary patterns.
Deviation detection:
This operation attempts to identify points that cannot
be fitted into a segment and then explain whether each such
point is noise or should be examined in more detail. This
operation usually operates in conjunction with data base
segmentation and, because "outliers" express deviation from
expected norms, often leads to true discovery.
CHARACTERISTICS OF DATA MINING:
In 70% of the applications, users perform data mining
using verification driven operations. Data analysts and
business analysts alike thoroughly understand query and
reporting, multidimensional analysis, and statistical analysis.
Neural and symbolic induction methods have only been
recently developed.
Two factors have inhibited the broad deployment of
applications that incorporate discovery driven data mining
techniques: the significant effort necessary to develop each

10
data mining application and the inappropriate state of the
data that an application must mine.
Application Development:
Most deployed data mining applications are not
developed by business analysts but through the collaboration
of data mining tool vendors, data analysts, and end users.
Because the tool vendors and data analysts usually first must
develop an understanding of the end users problem, such
collaborations are time consuming. Furthermore, the current
generation of data mining tools is aimed at the data analyst,
not the business analyst.
Data:
Data Systems rely on data bases to supply raw data for
input, and this raises the problems in that data bases tend to
become dynamic, but incomplete, noisy and large. Other
problems include the inadequacy and irrelevance of the
information stored. In all, the problems can be categorized
as:
• Limited information
• Noise or missing data
• Uncertainty, and
• Size, updates, and irrelevant fields
Limited Information:
A database is often designed for purposes different from
data mining, and sometimes the properties or attributes that
would simplify the learning task are neither present nor can
be requested from the real world. Inconclusive data causes
problems because if some attributes essential to knowledge
about the application domain are not present in the data, it
may be impossible to discover significant knowledge about a
given domain. For example, one cannot diagnose malaria
from a patient database if that database does not contain
the patient's red blood cell count.
Noise and missing value:

11
Databases are usually contaminated with errors, so it
cannot be assumed that the data they contain is entirely
correct. Attributes that rely on subjective or measurement
judgment can give rise to errors such that some examples
may even be misclassified. Errors either in values of
attributes or class information are known as noise. Obviously,
it is desirable to eliminate noise from the classification
information as it affects the overall accuracy of the
information.
Missing data can be treated by discovering systems in a
number of ways such as: simply disregarding missing values,
omitting corresponding records, inferring missing values
from known values, and treating the missing data as a
special value to be included additionally in the attribute
domain.
Uncertainty:
This refers to the severity of the error and the degree of
noise in the data. Data precision is an important
consideration in a discovery system.
Sizes, Updates and Irrelevant fields:
Data bases tend to be large and dynamic in that their
contents are ever changing as information is added, modified
or removed. The problem with this, from the data mining
perspective, is how to ensure that rules are up to date and
consistent with the most current information.
APPLICATIONS OF DATA MINING:
Predictive modeling techniques are best used when a
large body of historical data is available. This data is used to
model a variable of interest, so that this variable may be
forecast in future scenarios, and effective actions taken
based on that forecast. Some examples are as follows:
Risk Analysis:
Given a set of current customers and an assessment of
their risk worthiness, develop descriptions for various

12
classes. Use these descriptions to classify a new customer
into one of the risk categories.
Targeted Marketing:
Given a data base of potential customers and how they
have responded to a solicitation, develop a model of
customers most likely to respond positively. Use the model
for more focused new customer solicitation.
Customer retention:
Given a database of past customers and their behavior
prior to attribution, develop a model of customers most likely
to leave. Use the model for determining the best course of
action for these customers.
Portfolio management:
Given a particular financial asset, predict the return on
investment to determine whether to include the asset in a
folio or not.
Brand loyalty:
Given a particular customer and a product he or she
uses, predict whether the customer will switch brand.
Using frequent patterns extracted from data one can
build link analysis and item set analysis applications. These
are used for determining business values of promotional
effectiveness, analyzing subscriber services, forecasting
demand, etc. One of the more powerful applications of this
technique is market basket analysis, in which databases of
sales transactions are examined to extract patterns that
identify what items sell together, what items sell better when
relocated to new areas, and what product groupings improve
department sales. Clustering approaches are one of the
more pervasive applications of data mining. As databases
grow, it is often necessary to partition them into collection of
related records to obtain better summaries of the
subpopulations present in the data.
Another area of new application is the study and
analysis of Internet traffic. Just as sales and bank data could

13
be mined to help the retail store or bank improves its
products and marketing, Internet traffic on a web site can be
analyzed to better understand where the real demand is,
what pages are being looked at collectively, and so on.
Service providers can use this information to better organize
their web pages.
CONCLUSION:
Data mining is becoming an integral part of the
operations in organizations of varying sizes. Many
organizations that only recently have begun analyzing their
data have started to successfully use applications employing
verification-driven data mining techniques. Applications
using discovery-driven techniques are also finding increased
use. While many of the deployed applications primarily
employ predictive modeling techniques, application
developers and end users alike are beginning to recognize
the need to use additional techniques from the discovery-
driven data mining repertory. Applications with broad market
appeal such as market basket analysis and customer
segmentation have successfully demonstrated the
advantages of using such techniques.

14

You might also like