0% found this document useful (0 votes)
29 views

Chapter 3-IB

Data mining is defined as the nontrivial process of identifying valid, novel, potentially useful, and understandable patterns in data. It involves iterative steps to build mathematical models that identify patterns among attributes in data to generate explanatory or predictive patterns. These patterns include association, prediction, clustering and sequential relationships. Data mining has applications in customer relationship management, banking, manufacturing, medicine, entertainment and many other industries. It allows organizations to gain business intelligence from their data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Chapter 3-IB

Data mining is defined as the nontrivial process of identifying valid, novel, potentially useful, and understandable patterns in data. It involves iterative steps to build mathematical models that identify patterns among attributes in data to generate explanatory or predictive patterns. These patterns include association, prediction, clustering and sequential relationships. Data mining has applications in customer relationship management, banking, manufacturing, medicine, entertainment and many other industries. It allows organizations to gain business intelligence from their data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

CHAPTER 3: DATA MINING FOR BUSINESS

INTELLIGENCE
Definition of Data Mining

• data mining is "the nontrivial process of identifying valid, novel,


potentially useful, and ultimately understandable patterns in data
stored in structured databases"

• Other names: knowledge extraction, pattern analysis, knowledge


discovery, information harvesting, pattern searching, data
dredging.
Definition of Data Mining

• In this definition, the meanings of the key terms are as


follows:

• Process implies that data mining comprises many iterative steps.

• Nontrivial means that some experimentation-type search or


inference is involved ; that is, it is not as straightforward as a
computation of predefined quantities.

• Valid means that the discovered patterns should hold true on new
data with sufficient degree of certainty.
Definition of Data Mining
• Novel means that the patterns are not previously known to the
user within the context of the system being analyzed.

• Potentially useful means that the discovered patterns should


lead to some benefit to the user or task.

• Ultimately understandable means that the pattern should


make business sense that leads to the user saying "mmm! It
makes sense; why didn't I think of that"
Definition of Data Mining
■ Data mining is not a new discipline, but rather a new
definition for the use of many disciplines.

■ Data mining is tightly positioned at the intersection of


many disciplines, including statistics, artificial
intelligence , machine learning, management science,
information systems, and databases ( see next Figure)
How Data Mining Works
■ Using existing and relevant data, data mining builds models to identify patterns among
the attributes presented in the data set.
■ Models are the mathematical representations that identify the patterns among the
attributes of the objects (e.g., customers) described in the data set.
■ Some of these patterns are explanatory (explaining the interrelationships and affinities
among the attributes), whereas others are predictive (foretelling future values of certain
attributes)

■ Types of patterns
1. Association
2. Prediction
3. Cluster (segmentation)
4. Sequential (or time series) relationships
Data Mining Applications

■ Customer Relationship Management

– Maximize return on marketing campaigns

– Improve customer retention (churn analysis)

– Maximize customer value (cross-selling, upselling)

– Identify and treat most valued customer


Data Mining Applications
■ Banking
– automating the loan application process by predicting the most
probable defaulters.

– detecting fraudulent credit card and online-banking transactions.

– Maximize customer value (cross-selling, upselling)

– optimizing the cash return by forecasting the cash flow on


banking entities (e.g., ATM machines, banking branches).
Data Mining Applications
■ Manufacturers and Production:

– Predict/prevent machinery failures

– Identify anomalies in production systems to optimize the


use manufacturing capacity

– Discover novel patterns to improve product quality


Data Mining Applications
■ Medicine.
(1) identify novel patterns to improve survivability of patients with
cancer;

(2) predict success rates of organ transplantation patients to develop


better do nor-organ matching policies;

(3) identify the functions of different genes in the human chromosome


(known as genomics);

( 4) discover the relationships between symptoms and illnesses to make


informed and correct decisions in a timely manner.
Why Data Mining?
■ More intense competition at the global scale driven by customers' ever-changing
needs and wants

■ Recognition of the value in data sources

■ Availability of quality data on customers, vendors, transactions, Web, etc.

■ Consolidation of databases and other data repositories into a single location in the
form of a data warehouse.

■ The exponential increase in data processing and storage capabilities.

■ Significant reduction in the cost of hardware and software for data storage and
processing.
Data Mining Applications
■ Entertainment industry
– analyze viewer data to decide what programs to show during prime
time and how to maximize returns by knowing where to insert
advertisements;
– predict the financial success of movies before they are produced to
make investment decisions and to optimize the returns;
– forecast the demand at different locations and different times to
better schedule entertainment events and to optimally allocate
resources
Data Mining Applications
■ Sports.
■ Healthcare.
■ Insurance.
■ Travel industry
■ Government and defense.
■ Brokerage and securities trading.
■ Retailing and logistics.
Characteristics and Objectives of DM
The following are the major characteristics and objectives
of data mining:
▪Source of data for DM is often (but not always) a consolidated
data warehouse
▪DM environment is usually a client-server or a Web-based
information systems architecture
▪Data is the most critical ingredient for DM which may include
soft/unstructured data.
Characteristics and Objectives of DM

▪The miner is often an end user


▪ Striking it rich requires creative thinking.

■ Data mining tools are readily combined with other software


development tools. Thus, the mined data can be analyzed
and deployed quickly and easily.
DATA MINING PROCESS
Most common standard processes:
• CRISP-DM (Cross-Industry Standard Process for Data
Mining)
• SEMMA (Sample, Explore, Modify, Model, and Assess)
• KDD (Knowledge Discovery in Databases)
CRISP-DM
■ Step 1: Business Understanding:
• The key element of any data mining study is to know what the
study is for.
• Then a project plan for finding such knowledge is developed
that specifies the people responsible for collecting the data,
analyzing the data, and reporting the findings.
• At this early stage, a budget to support the study should also be
established.
Step 2: Data Understanding:
■ First, the analyst should be clear and concise about the description of the data
mining task so that the most relevant data can be identified

■ Furthermore, the analyst should understand:

– the data sources (e.g., where the relevant data are stored and in what form;
what the process of collecting the data is-automated versus manual; who
the collectors of the data are and how often the data are updated)

– the variables (e.g., What are the most relevant variables? Are the variables
independent of each other- do they stand as a complete information source
without overlapping or conflicting information?)
Step 2: Data Understanding:
■ Data sources for data selection can vary. Normally, data sources for
business applications include:
– demographic data (such as income, education, number of
households, and age),
– sociographic data (such as hobby, club membership, and
entertainment),
– transactional data (sales record, credit card spending, issued
checks), and so on.
Step 2: Data Understanding:
■ Data can be categorized as quantitative and qualitative.

■ Quantitative data is measured using numeric values. It can be discrete (such as


integers) or continuous (such as real numbers).

■ Qualitative data, also known as categorical data, contains both nominal and
ordinal data.

–Nominal data has finite nonordered values (e.g., gender data, which has two
values: male and female).

–Ordinal data has finite ordered values. For example, customer credit ratings are
considered ordinal data because the ratings can be excellent, fair, and bad.
Step 3: Data Preparation/ Data Preprocessing
- The goal is taking the data identified in the previous step and
preparing it for analysis by data mining methods.
- Compared to the other steps, data preprocessing consumes the most
time and effort. ( why?)
- The real-world data is
- incomplete (lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data)
- noisy (containing errors or outliers)
- inconsistent (containing discrepancies in codes or names).
Step 3: Data
Preparation
Stages
Step 3: Data Preparation
■ Data Preparation involves four main steps:
1. Data Consolidation

■ In the first phase of data preprocessing, the relevant data is collected


from the identified sources (accomplished in the previous step),

■ the necessary records and variables are selected (based on an


intimate understanding of the data, the unnecessary sections are
filtered out), and the records coming from multiple data sources are
integrated.
Step 3: Data Preparation

2. Data Cleaning (data scrubbing ):


■ In the second phase of data preprocessing, the data is cleaned
■ In this step, the values in the data set are identified and dealt with.
■ In some cases, missing values are an anomaly in the data set, in
which case they need to be imputed (filled with a most probable
value) or ignored;
■ In other cases, the missing values are a natural part of the data set
(e.g., the household income field is often left unanswered by people
who are in the top income tier).
Step 3: Data Preparation
2. Data Cleaning:
■ In this step, the analyst should also identify noisy values in the data (i.e.,
the outliers) and smooth them out.

■ Additionally, inconsistencies (unusual values within a variable) in the data


should be handled using domain knowledge and/or expert opinion
Step 3: Data Preparation
3. Data Transformation:
■ In the third phase , the data is transformed for better processing.

■ in many cases the data is normalized between a certain


minimum and maximum for all variables in order to mitigate the
potential bias of one variable (having large numeric values, such
as for household income) dominating other variables having
smaller values.
Step 3: Data Preparation
3. Data Transformation
■ Another transformation is discretization and/or aggregation. In some
cases, the numeric variables are converted to categorical values (e .g.,
low, medium, high)

■ in other cases one might choose to create new variables based on the
existing ones to magnify the information found in variables in the data
set.
Step 3: Data Preparation
4. Data Reduction:
■ Even though data miners like to have large data sets, too much data is also
a problem.

■ In the simplest sense, data can be visualized in data mining projects as a


flat file consisting of two dimensions:

– variables (the number of columns)

– cases/ records (the number of rows).


Step 3: Data Preparation
4. Data Reduction:
■ In some cases, the number of variables can be rather large, and the analyst must
reduce the number to a manageable size.

■ some data sets may include millions of records. Even though computing power is
increasing exponentially, processing such as large records may not be practical or
feasible. In such cases, one may need to sample a subset of the data for analysis.

■ The analyst should be careful in selecting a subset of the data that reflects the
essence of the complete data set and is not specific to a subgroup or subcategory.
Step 4: Model Building
■ In this step, various modeling techniques are selected and applied to a prepared
data set in order to address the specific business need.

■ The model-building step also encompasses the assessment and comparative


analysis of the various models built.(why).

■ Some methods may have specific requirements on the way that the data is to be
formatted; thus, stepping back to the data preparation step is often necessary.

■ Depending on the business need, the data mining task can be classification, an
association, or a clustering type,
Step 5: Testing and Evaluation
■ In step 5, the developed models are assessed and evaluated for their accuracy.

■ This step assesses the degree to which the selected model (or models) meets the
business objectives and, if so, to what extent .

■ Another option is to test the developed model(s) in a real-world scenario if time and
budget constraints permit.

■ This step is a critical and challenging task. No value is added by the data mining task
until the business value obtained from discovered patterns is identified and
recognized.
Step 5: Testing and Evaluation
■ The success of this step depends on the interaction among data analysts,
business analysts, and decision makers (such as business managers).

■ Because data analysts may not have the full understanding of the data
mining objectives and what they mean to the business

■ and the business analysts and decision makers may not have the technical
knowledge to interpret the results of sophisticated mathematical solutions,
interaction among them is necessary.
Step 6: Deployment
■ Depending on the requirements, the deployment phase can be as simple as
generating a report or as complex as implementing a repeatable data mining
process across the enterprise.

■ The deployment step may also include maintenance activities for the deployed
models.

■ Over time, the models (and the patterns embedded within them) built on the
old data may become obsolete, irrelevant, or misleading.
SEMMA
■ SEMMA ="sample, explore, modify, model, and assess."
■ Sample= Beginning with a statistically representative sample of the
data.

■ Explore= apply exploratory statistical and visualization techniques,


select and transform the most significant predictive variables.
■ Model= model the variables to predict outcomes.
■ Assess= confirm a model's accuracy.
CRISP-DM VS SEMMA

■ CRISP-DM takes a more comprehensive approach-including


understanding of the business and the relevant data-to data
mining projects, whereas SEMMA assumes that the data
mining project's goals and data sources have been
identified and understood.
KDD Process
■ knowledge discovery in databases as a process of using data mining
methods to find useful information and patterns in the data.
■ KDD is a comprehensive process that encompasses data mining.
■ The input to the KDD process is organizational data. EDW enables KDD to
be implemented because it provides a single source for data to be mined.
■ Dunham (2003) summarized the KDD process as consisting of the following
steps:
– data selection
– data preprocessing,
– data transformation
– data mining,
– interpretation/ evaluation.
KDD Process
Data Mining Methods
■ Classification
■ the most frequently used data mining method for real-world problems.

■ Member of the machine-learning family of techniques.

■ Learn from past data to classify new data.


– Ex, using classification to predict whether the weather on a particular day will be
"sunny," "rainy,“ or "cloudy."

– Popular classification tasks include credit approval (i.e., good or bad credit, risk),

– store location (e.g., good, moderate, bad),


GENERAL APPROACH
TO BUILD
CLASSIFICATION
MODEL
Data Mining Methods
■ Classification
■ The most common two-step methodology of classification-type
prediction involves:
– model training and,
– model testing.
■ In the model development phase, a collection of input data, including
the actual class labels, is used.
■ After a model has been trained, the model is tested against the
holdout sample for accuracy assessment
■ and eventually deployed for actual use where it is to predict classes of
new data instances (where the class label is unknown).
Estimating the True Accuracy of Classification Models

Confusion Matrix (or a Classification Matrix)


■ A confusion matrix is a table that is often used to describe the performance of a
classification model (or "classifier") on a set of test data for which the true
values are known.
■ The next figure represents a confusion matrix for a binary classifier.
■ Binary: possesses only two values i.e. True or False.
■ When the classification problem is not binary, the confusion matrix gets bigger.
Confusion Matrix for a binary classifier
Confusion Matrix
■ true positives (TP): These are cases in which we predicted yes (the products are
useful), and actually they are useful.
■ true negatives (TN): We predicted no, and they are not useful.
■ false positives (FP): We predicted yes, but they aren't actually useful.
■ false negatives (FN): We predicted no, but they are actually useful.
■ This is a list of rates that are often computed from a confusion matrix for a binary
classifier:
■ Accuracy: Overall, how often is the classifier correct?
– (TP+TN)/total
■ Misclassification Rate: Overall, how often is it wrong?
– (FP+FN)/total
– equivalent to 1 minus Accuracy
Confusion Matrix
■ True Positive Rate: When it's actually yes, how often does it predict yes?
– TP/actual yes
– also known as "Recall"
■ False Positive Rate: When it's actually no, how often does it predict yes?
– FP/actual no
■ True Negative Rate: When it's actually no, how often does it predict no?
– TN/actual no
– equivalent to 1 minus False Positive Rate
■ Precision: When it predicts yes, how often is it correct?
– TP/predicted yes
An Example of Confusion Matrix
■ Accuracy:
– (TP+TN)/total = (100+50)/165 = 0.91
■ Misclassification Rate:
– (FP+FN)/total = (10+5)/165 = 0.09
■ True Positive Rate:
– TP/actual yes = 100/105 = 0.95
■ False Positive Rate:
– FP/actual no = 10/60 = 0.17
■ True Negative Rate:
– TN/actual no = 50/60 = 0.83
■ Precision:
– TP/predicted yes = 100/110 = 0.91
Estimation Methodologies for Classification
■ Simple split (or test sample estimation):
■ Dividing the data into two mutually subsets called:
– a training set and,
– a test set (or holdout set).
■ It is common to designate two-thirds of the data as the training set and the remaining one-third
as the test set.
■ The training set is used by the inducer (model builder), and the built classifier is then tested on
the test set.
Estimation Methodologies for Classification
■ k-Fold Cross Validation
■ It is where a given data set is split into a K number of sections/folds where each
fold is used as a testing set at some point.
■ in case of having: 5-Fold cross validation(K=5).
■ Here, the data set is split into 5 folds.
■ In the first iteration, the first fold is used to test the model and the rest are used
to train the model.
■ In the second iteration, 2nd fold is used as the testing set while the rest serve as
the training set.
■ This process is repeated until each fold of the 5 folds have been used as the
testing set.
K-Fold Cross Validation
Classification Techniques: Decision Tree.
■ It refers to the form of a tree structure. It breaks down a dataset into smaller and
smaller subsets while at the same time an associated decision tree is gradually
developed.

■ A tree can be “learned” by splitting the source set into subsets based on an attribute
value test (Inputs) . This process is repeated on each derived subset in a recursive
manner called ”recursive partitioning”.

■ The basic idea is to ask questions whose answers would provide the most
information.

■ Decision trees are one of the most popular machine learning algorithms, and it can
handle both categorical and numerical data.
The tree has three types of nodes:

A root node that has no incoming edges and zero


or more outcomes edges.

Decision Tree Internal nodes each of which has exactly one


incoming edges and two or more outgoing edges.

Leaf or terminal nodes: each of which has exactly


one incoming edge and no outgoing edges.
A decision tree for the mammal classification problem
Cluster Analysis for Data Mining

■ Cluster analysis or clustering is the process of partitioning a set of data objects


(or observations) into subsets.

■ Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters.

■ Different clustering methods may generate different clusterings on the same


data set.

■ The partitioning is not performed by humans, but by the clustering algorithm.


Cluster Analysis for Data Mining
Cluster Analysis for Data Mining
■ The method is commonly used in biology, medicine, genetics, social network
analysis, anthropology, archaeology.

■ Cluster analysis has been used extensively for fraud detection and market
segmentation of customers in CRM systems.

■ Clustering has also found many applications in Web search.

■ Clustering can also be used for outlier detection, where outliers (values that are
“far away” from any cluster) may be more interesting than common cases.
Cluster Analysis for Data Mining
■ Applications of outlier detection include the detection of credit card fraud and
the monitoring of criminal activities in e-commerce.

– Ex, unusual cases in credit card transactions, such as very expensive and
infrequent purchases, may be of interest as possible fraudulent activities.

■ Clustering is known as unsupervised learning because the class label


information is not present. For this reason, clustering is a form of learning by
observation, rather than learning by examples.

■ Most cluster analysis methods involve the use of a distance measure to


calculate the closeness between pairs of items.
ANALYSIS METHODS
■ Cluster analysis may be based on one or more of the following general
methods:
– Statistical methods, such as k-means, k-modes, and so on
– Neural networks
– Fuzzy logic
– Genetic algorithms
■ Each of these methods generally works with one of two general method
classes:
– Divisive: With divisive classes, all items start in one cluster and are
broken apart.
– Agglomerative: With agglomerative classes, all items start in individual
clusters, and the clusters are joined together.
Classification Vs. Clustering
■ classification uses predefined classes in which objects are assigned, while
clustering identifies similarities between objects, which it groups according to those
characteristics in common and which differentiate them from other groups of objects.

■ clustering is framed in unsupervised learning; that is, for this type of algorithm we only
have one set of input data (not labelled), about which we must obtain information, without
previously knowing what the output will be.

■ On the other hand, classification belongs to supervised learning, which means that we
know the input data (labeled in this case) and we know the possible output of the
algorithm.
Association Rule Mining
■ Association rule mining (also known as affinity analysis or market-basket
analysis) is a popular data mining method.
■ Part of machine learning family
■ It aims to find interesting relationships (affinities) between variables (items)
in large databases.
■ The input to market-basket analysis is simple point-of-sale transaction data,
where a number of products and/or services purchased together
■ Then they are tabulated under a single transaction instance.
■ The outcome of the analysis is invaluable information.
Association Rule Mining

A business can take advantage of such knowledge by:

(1) putting the items next to each other to make it more convenient for the
customers to pick them up together and not forget to buy one when buying the
others (increasing sales volume);

(2) promoting the items as a package (do not put one on sale if the other(s) are on
sale)

(3) placing them apart from each other so that the customer has to walk the aisles
to search for it, and by doing so potentially seeing and buying other items.
Association Rule Mining
Are all association rules interesting and useful?
■ A Generic Rule: X → Y [S%, C%]
■ X, Y: products and/or services
■ X: Left-hand-side (LHS) or (antecedent),
■ Y: Right-hand-side (RHS) or (consequent)
■ S: Support: how often X and Y go together
■ C: Confidence: how often Y go together with the X
■ Example: {Laptop Computer, Antivirus Software} → {Extended Service
Plan} [30%, 70%]
Association Rule Mining
■ Support refers to the percentage of baskets where the rule was true (both left
and right side products were present).

■ Confidence measures what percentage of how often the products on the RHS
go together with the products on the LHS

■ Lift measures how much more frequently the left-hand item is found with the
right than without the right.
Apriori Algorithm

• Finds subsets that are common to at least a minimum number of the itemsets
• uses a bottom-up approach
– frequent subsets are extended one item at a time (the size of frequent
subsets increases from one-item subsets to two-item subsets, then three-
item subsets, and so on), and
– groups of candidates at each level are tested against the data for
minimum support
– see the figure…
Apriori Algorithm
Data Mining Myths
Myths Reality
Data mining provides instant, crystal-ball-like Data mining is a multistep process that requires
predictions. thoughtful, proactive design and use.

Data mining is not yet viable for business The current state-of-the-art is ready to go for
applications. almost any business.

Data mining requires a separate, devoted Because of advances in database technology, a


database. devoted database is not required, even though it
may be desirable.
Only those with advanced degrees can do data Newer Web-based tools enable managers of all
mining. educational levels to do data mining.

Data mining is only for large firms that have lots If the data accurately reflect the business or its
of customer data customers, a company can use data mining.
Common Data Mining Mistakes
■ selecting the wrong problem for data mining

■ leaving insufficient time for data preparation,

■ Looking only at aggregated results and not at individual records

■ Being sloppy about keeping track of the data mining procedure and results

■ Ignoring suspicious (good or bad) findings and quickly moving on.

■ Running mining algorithms repeatedly and blindly, without thinking about the
next stage.

You might also like