0% found this document useful (0 votes)
42 views

Data Mining

Data mining refers to analyzing large quantities of data stored on computers to extract useful information and discover patterns. Businesses use data mining to develop effective marketing strategies, increase sales, and reduce costs. Examples of data mining include grocery stores analyzing purchase data and medical facilities diagnosing patients. Key data mining techniques are association, classification, clustering, prediction, and analysis of sequential patterns. Classification methods like decision trees and neural networks are used to categorize data.

Uploaded by

Tinashe Kota
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Data Mining

Data mining refers to analyzing large quantities of data stored on computers to extract useful information and discover patterns. Businesses use data mining to develop effective marketing strategies, increase sales, and reduce costs. Examples of data mining include grocery stores analyzing purchase data and medical facilities diagnosing patients. Key data mining techniques are association, classification, clustering, prediction, and analysis of sequential patterns. Classification methods like decision trees and neural networks are used to categorize data.

Uploaded by

Tinashe Kota
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

DATA MINING

TINEYI CHINHENGO M172823


DEFINITION

 Data mining refers to the analysis of the large quantities of data that are stored in
computers.
 It refers to extraction or “mining” knowledge from large amounts of data.
 Data mining is a process used by companies to turn raw data into useful information.
By using software to look for patterns in large batches of data, businesses can learn
more about their customers to develop more effective marketing strategies, increase
sales and decrease costs. Data mining depends on effective data collection,
warehousing, and computer processing.
A BIG PICTURE OF DATA MINING
EXAMPLES OF DATA MINING

 Grocery stores have large amounts of data generated by our purchases.


 Bar coding has made checkout very convenient for us, and provides retail establishments with masses
of data. Grocery stores and other retail stores are able to quickly process our purchases, and use
computers to accurately determine product prices.
Information gathered through bar coding can be used for data mining analysis.
 Data mining has been heavily used in the medical field, to include diagnosis of patient records to help
identify best practices.
EXAMPLES OF DATA MINING

 Data mining is widely used by banking firms in soliciting credit card customers,4 by
insurance and telecommunication companies in detecting fraud.
 By telephone companies and credit card issuers in identifying those potential
customers most likely to churn.
 By manufacturing firms in quality control, and many other applications.
EXAMPLES OF DATA MINING

 Data mining can be used by businesses in many ways. Three examples are:
 1. Customer profiling, identifying those subsets of customers most profitable to the
business;
 2. Targeting, determining the characteristics of profitable customers who have been
captured by competitors;
 3. Market-basket analysis, determining product purchases by consumer, which can be
used for product positioning and for cross-selling.
WHY DO WE NEED DATA MINING

 Fraud detection
 Potential clients
 Quality control
 Product positioning
 Cross-selling
WHAT IS NEEDED TO DO DATA MINING

 Data mining requires identification of a problem, along with collection of data that
can lead to better understanding, and computer models to provide statistical or other
means of analysis.
 Data mining tools need to be versatile, scalable, capable of accurately predicting
responses between actions and results, and capable of automatic implementation.
 Versatile refers to the ability of the tool to apply a wide variety of models. Scalable
tools imply that if the tools works on a small data set, it should also work on larger
data sets.
DATA MINING TECHNIQUES

 Data Mining can be achieved by

 Association,

 Classification,

 Clustering,

 Prediction,

 Sequential Patterns and

 Similar Time Sequences.


ASSOCIATION

 the relationship of a particular item in a data transaction on other items in the same transaction is used
to predict patterns.
 For example, if a customer purchases a laptop PC (X), then he or she also buys a mouse (Y) in 60% of
the cases. This pattern occurs in 5.6% of laptop PC purchases. An association rule in this situation can
be “X implies Y, where 60% is the confidence factor and 5.6% is the support factor.” When the
confidence factor and support factor are represented by linguistic variables “high” and “low,”
respectively, the association rule can be written in the fuzzy logic form, such as: “where the support
factor is low, X implies Y is high.” In the case of many qualitative variables, fuzzy association is a
necessary and promising technique in data mining.
CLASSIFICATION

 The methods are intended for learning different functions that map each item of the selected data into one of a predefined
set of classes. Given the set of predefined classes, a number of attributes, and a “learning (or training) set,” the
classification methods can automatically predict the class of other unclassified data of the learning set. Two key research
problems related to classification results are the evaluation of misclassification and prediction power. Mathematical
techniques that are often used to construct classification methods are
 binary decision trees,

 Neural networks,

 linear programming and

 statistics
BINARY DECISION TREE

 By using decision trees, a tree induction model with a “Yes–No” format can be built to split data into
different classes according to its attributes.
 Models fit to data can be measured by either statistical estimation or information entropy.
 Clearly lay out the problem so that all options can be challenged.
 Allow us to analyze fully the possible consequences of a decision.
 It provides a framework to quantify the values of outcomes and the probabilities of achieving them.
 However, the classification obtained from tree induction may not produce an optimal solution where
prediction power is limited.
HOW TO CONSTRUCT A DECISION TREE

‘Root node’
or ‘the root’ These are called
‘internal nodes’
or just ‘nodes’

‘leaf nodes’
or ‘leaves’
DECISION TREE

 Internal nodes have arrows pointing to them.

 And they have arrows pointing away from them.

 Leaf nodes have arrows pointing to them, but there are no arrows pointing away from them.
NEURAL NETWORKS

 By using neural networks, a neural induction model can be built.


 In this approach, the attributes become input layers in the neural network while the classes associated
with data are output layers. Between input layers and output layers, there are a larger number of hidden
layers processing the accuracy of the classification.
 Although the neural induction model often yields better results in many cases of data mining, since the
relationships involve complex nonlinear relationships, implementing this method is difficult when
there’s a large set of attributes.
HOW TO CONSTRUCT A NEURAL NETWORK

0.8
channels

0.2
Which performs most of the computations required by
0.9
our network.
0.1 0.1
0.6
Actual Error
output
0.3 0.2 0.3 0.5
0 -0.5
0.8 0.7
0.7 0.4

Neuron
1 0.6
0.1 0.8 0.6 0.1 0 -0.1
0.9

0.2
0.1 0.3
Output layer 123
0.9
Activatio
input layer
Hidden layers
n
Forward Propagation

Back Propagation
LINEAR PROGRAMMING

 In linear programming approaches, the classification problem is viewed as a special form of linear program.

 Given a set of classes and a set of attribute variables, one can define a cutoff limit (or boundary) separating the
classes.
 Then each class is represented by a group of constraints with respect to a boundary in the linear program.

 The objective function in the linear programming model can minimize the overlapping rate across classes and
maximize the distance between classes.
 The linear programming approach results in an optimal classification.
STATISTICAL

 However, the computation time required may exceed that of statistical approaches.
 Various statistical methods, such as linear discriminant regression, quadratic discriminant regression,
and logistic discriminant regression are very popular and are commonly used in real business
classifications.
 Even though statistical software has been developed to handle a large amount of data, statistical
approaches have a disadvantage in efficiently separating multiclass problems in which a pair-wise
comparison (i.e., one class versus the rest of the classes) has to be adopted.
CLUSTERING

 Cluster analysis takes ungrouped data and uses automatic techniques to put this data
into groups.
 Clustering is unsupervised, and does not require a learning set.
 It shares a common methodological ground with Classification.
 In other words, most of the mathematical models mentioned earlier in regards to
Classification can be applied to Cluster Analysis as well.
PREDICTION

 Prediction analysis is related to regression techniques.


 The key idea of prediction analysis is to discover the relationship between the dependent and
independent variables, the relationship between the independent variables (one versus Another, one
versus the rest, and so on).
 For example, if sales is an independent variable, then profit may be a dependent variable.
 By using historical data from both sales and profit, either linear or nonlinear regression techniques can
produce a fitted regression curve that can be used for profit prediction in the future.
SEQUENTIAL PATTERN

 Sequential Pattern analysis seeks to find similar patterns in data transaction over a business period.

 These patterns can be used by business analysts to identify relationships among data.

 The mathematical models behind Sequential Patterns are logic rules, fuzzy logic, and so on.

 As an extension of Sequential Patterns, Similar Time Sequences are applied to discover sequences similar to a
known sequence over both past and current business periods.
 In the data mining stage, several similar sequences can be studied to identify future trends in transaction
development.
 This approach is useful in dealing with databases that have time-series characteristics.
DATA MINING PROCESS

 In order to systematically conduct data mining analysis, a general process is usually followed.

 CRISP-DM – is an industry standard process consisting of a sequence of steps that are usually involved in a data
mining study.
 SEMMA – is specific to SAS, while each step of either approach is not needed in every analysis, this process
provides a good coverage of the steps needed, starting with data exploration, data collection, data processing,
analysis, inferences drawn and implementation.
CRISP-DM

 There is a Cross-Industry Standard Process for Data Mining (CRISP-DM) widely used by industry
members.
 This model consists of six phases intended as a cyclical process.

Business Understanding
 Business understanding includes determining business objectives, assessing the current situation,
establishing data mining goals, and developing a project plan.
CRISP-DM

Data Understanding
 Once business objectives and the project plan are established, data understanding considers data requirements.

 This step can include initial data collection, data description, data exploration, and the verification of data quality.

 Data exploration such as viewing summary statistics (which includes the visual display of categorical variables)
can occur at the end of this phase.
 Models such as cluster analysis can also be applied during this phase, with the intent of identifying patterns in the
data.
CRISP-DM

Data Preparation
 Once the data resources available are identified, they need to be selected, cleaned, built into the form
desired, and formatted.
 Data cleaning and data transformation in preparation of data modeling needs to occur in this phase.
 Data exploration at a greater depth can be applied during this phase, and additional models utilized,
again providing the opportunity to see patterns based on business understanding.
CRISP-DM PROCESS
CRISP-DM

Modeling
 Data mining software tools such as visualization (plotting data and establishing relationships) and
cluster analysis (to identify which variables go well together) are useful for initial analysis.
 Tools such as generalized rule induction can develop initial association rules.
 Once greater data understanding is gained (often through pattern recognition triggered by viewing
model output), more detailed models appropriate to the data type can be applied.
 The division of data into training and test sets is also needed for modeling.
Evaluation
 Model results should be evaluated in the context of the business objectives established in the first phase (business

 understanding).

 This will lead to the identification of other needs (often through pattern recognition), frequently reverting to prior
phases of CRISP-DM.
 Gaining business understanding is an iterative procedure in data mining, where the results of various visualization,
statistical, and artificial intelligence tools show the user new relationships that provide a deeper understanding of
organizational operations.
CRISP-DM

Deployment
 Data mining can be used to both verify previously held hypotheses, or for knowledge discovery (identification of
unexpected and useful relationships).
 Through the knowledge discovered in the earlier phases of the CRISP-DM process, sound models can be obtained
that may then be applied to business operations for many purposes, including prediction or identification of key
situations.
 These models need to be monitored for changes in operating conditions, because what might be true today may
not be true a year from now.
 If significant changes do occur, the model should be redone.

 It’s also wise to record the results of data mining projects so documented evidence is available for future studies.
CRISP-DM

 There’s usually a great deal of backtracking.


 Additionally, experienced analysts may not need to apply each phase for every study.
 CRISP-DM provides a useful framework for data mining.

You might also like