Data Mining
Data Mining
Data mining refers to the analysis of the large quantities of data that are stored in
computers.
It refers to extraction or “mining” knowledge from large amounts of data.
Data mining is a process used by companies to turn raw data into useful information.
By using software to look for patterns in large batches of data, businesses can learn
more about their customers to develop more effective marketing strategies, increase
sales and decrease costs. Data mining depends on effective data collection,
warehousing, and computer processing.
A BIG PICTURE OF DATA MINING
EXAMPLES OF DATA MINING
Data mining is widely used by banking firms in soliciting credit card customers,4 by
insurance and telecommunication companies in detecting fraud.
By telephone companies and credit card issuers in identifying those potential
customers most likely to churn.
By manufacturing firms in quality control, and many other applications.
EXAMPLES OF DATA MINING
Data mining can be used by businesses in many ways. Three examples are:
1. Customer profiling, identifying those subsets of customers most profitable to the
business;
2. Targeting, determining the characteristics of profitable customers who have been
captured by competitors;
3. Market-basket analysis, determining product purchases by consumer, which can be
used for product positioning and for cross-selling.
WHY DO WE NEED DATA MINING
Fraud detection
Potential clients
Quality control
Product positioning
Cross-selling
WHAT IS NEEDED TO DO DATA MINING
Data mining requires identification of a problem, along with collection of data that
can lead to better understanding, and computer models to provide statistical or other
means of analysis.
Data mining tools need to be versatile, scalable, capable of accurately predicting
responses between actions and results, and capable of automatic implementation.
Versatile refers to the ability of the tool to apply a wide variety of models. Scalable
tools imply that if the tools works on a small data set, it should also work on larger
data sets.
DATA MINING TECHNIQUES
Association,
Classification,
Clustering,
Prediction,
the relationship of a particular item in a data transaction on other items in the same transaction is used
to predict patterns.
For example, if a customer purchases a laptop PC (X), then he or she also buys a mouse (Y) in 60% of
the cases. This pattern occurs in 5.6% of laptop PC purchases. An association rule in this situation can
be “X implies Y, where 60% is the confidence factor and 5.6% is the support factor.” When the
confidence factor and support factor are represented by linguistic variables “high” and “low,”
respectively, the association rule can be written in the fuzzy logic form, such as: “where the support
factor is low, X implies Y is high.” In the case of many qualitative variables, fuzzy association is a
necessary and promising technique in data mining.
CLASSIFICATION
The methods are intended for learning different functions that map each item of the selected data into one of a predefined
set of classes. Given the set of predefined classes, a number of attributes, and a “learning (or training) set,” the
classification methods can automatically predict the class of other unclassified data of the learning set. Two key research
problems related to classification results are the evaluation of misclassification and prediction power. Mathematical
techniques that are often used to construct classification methods are
binary decision trees,
Neural networks,
statistics
BINARY DECISION TREE
By using decision trees, a tree induction model with a “Yes–No” format can be built to split data into
different classes according to its attributes.
Models fit to data can be measured by either statistical estimation or information entropy.
Clearly lay out the problem so that all options can be challenged.
Allow us to analyze fully the possible consequences of a decision.
It provides a framework to quantify the values of outcomes and the probabilities of achieving them.
However, the classification obtained from tree induction may not produce an optimal solution where
prediction power is limited.
HOW TO CONSTRUCT A DECISION TREE
‘Root node’
or ‘the root’ These are called
‘internal nodes’
or just ‘nodes’
‘leaf nodes’
or ‘leaves’
DECISION TREE
Leaf nodes have arrows pointing to them, but there are no arrows pointing away from them.
NEURAL NETWORKS
0.8
channels
0.2
Which performs most of the computations required by
0.9
our network.
0.1 0.1
0.6
Actual Error
output
0.3 0.2 0.3 0.5
0 -0.5
0.8 0.7
0.7 0.4
Neuron
1 0.6
0.1 0.8 0.6 0.1 0 -0.1
0.9
0.2
0.1 0.3
Output layer 123
0.9
Activatio
input layer
Hidden layers
n
Forward Propagation
Back Propagation
LINEAR PROGRAMMING
In linear programming approaches, the classification problem is viewed as a special form of linear program.
Given a set of classes and a set of attribute variables, one can define a cutoff limit (or boundary) separating the
classes.
Then each class is represented by a group of constraints with respect to a boundary in the linear program.
The objective function in the linear programming model can minimize the overlapping rate across classes and
maximize the distance between classes.
The linear programming approach results in an optimal classification.
STATISTICAL
However, the computation time required may exceed that of statistical approaches.
Various statistical methods, such as linear discriminant regression, quadratic discriminant regression,
and logistic discriminant regression are very popular and are commonly used in real business
classifications.
Even though statistical software has been developed to handle a large amount of data, statistical
approaches have a disadvantage in efficiently separating multiclass problems in which a pair-wise
comparison (i.e., one class versus the rest of the classes) has to be adopted.
CLUSTERING
Cluster analysis takes ungrouped data and uses automatic techniques to put this data
into groups.
Clustering is unsupervised, and does not require a learning set.
It shares a common methodological ground with Classification.
In other words, most of the mathematical models mentioned earlier in regards to
Classification can be applied to Cluster Analysis as well.
PREDICTION
Sequential Pattern analysis seeks to find similar patterns in data transaction over a business period.
These patterns can be used by business analysts to identify relationships among data.
The mathematical models behind Sequential Patterns are logic rules, fuzzy logic, and so on.
As an extension of Sequential Patterns, Similar Time Sequences are applied to discover sequences similar to a
known sequence over both past and current business periods.
In the data mining stage, several similar sequences can be studied to identify future trends in transaction
development.
This approach is useful in dealing with databases that have time-series characteristics.
DATA MINING PROCESS
In order to systematically conduct data mining analysis, a general process is usually followed.
CRISP-DM – is an industry standard process consisting of a sequence of steps that are usually involved in a data
mining study.
SEMMA – is specific to SAS, while each step of either approach is not needed in every analysis, this process
provides a good coverage of the steps needed, starting with data exploration, data collection, data processing,
analysis, inferences drawn and implementation.
CRISP-DM
There is a Cross-Industry Standard Process for Data Mining (CRISP-DM) widely used by industry
members.
This model consists of six phases intended as a cyclical process.
Business Understanding
Business understanding includes determining business objectives, assessing the current situation,
establishing data mining goals, and developing a project plan.
CRISP-DM
Data Understanding
Once business objectives and the project plan are established, data understanding considers data requirements.
This step can include initial data collection, data description, data exploration, and the verification of data quality.
Data exploration such as viewing summary statistics (which includes the visual display of categorical variables)
can occur at the end of this phase.
Models such as cluster analysis can also be applied during this phase, with the intent of identifying patterns in the
data.
CRISP-DM
Data Preparation
Once the data resources available are identified, they need to be selected, cleaned, built into the form
desired, and formatted.
Data cleaning and data transformation in preparation of data modeling needs to occur in this phase.
Data exploration at a greater depth can be applied during this phase, and additional models utilized,
again providing the opportunity to see patterns based on business understanding.
CRISP-DM PROCESS
CRISP-DM
Modeling
Data mining software tools such as visualization (plotting data and establishing relationships) and
cluster analysis (to identify which variables go well together) are useful for initial analysis.
Tools such as generalized rule induction can develop initial association rules.
Once greater data understanding is gained (often through pattern recognition triggered by viewing
model output), more detailed models appropriate to the data type can be applied.
The division of data into training and test sets is also needed for modeling.
Evaluation
Model results should be evaluated in the context of the business objectives established in the first phase (business
understanding).
This will lead to the identification of other needs (often through pattern recognition), frequently reverting to prior
phases of CRISP-DM.
Gaining business understanding is an iterative procedure in data mining, where the results of various visualization,
statistical, and artificial intelligence tools show the user new relationships that provide a deeper understanding of
organizational operations.
CRISP-DM
Deployment
Data mining can be used to both verify previously held hypotheses, or for knowledge discovery (identification of
unexpected and useful relationships).
Through the knowledge discovered in the earlier phases of the CRISP-DM process, sound models can be obtained
that may then be applied to business operations for many purposes, including prediction or identification of key
situations.
These models need to be monitored for changes in operating conditions, because what might be true today may
not be true a year from now.
If significant changes do occur, the model should be redone.
It’s also wise to record the results of data mining projects so documented evidence is available for future studies.
CRISP-DM