Unit 3 Data Mining
Unit 3 Data Mining
• Significant reduction in the cost of hardware and software for data storage and
processing.
Data Mining – Definitions, characteristics and Benefits
• Simply defined Data mining is a term used to describe discovering or “mining”
knowledge from large amounts of data.
• Technically speaking, data mining is a process that uses statistical, mathematical, and
artificial intelligence techniques to extract and identify useful information and
subsequent knowledge (or patterns) from large sets of data.
• These patterns can be in the form of business rules, affinities, correlations, trends, or
prediction models
• Most literature defines data mining as “the nontrivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns in data stored in
structured databases,” In this definition, the meanings of the key term are as follows:
Data Mining – Definitions, characteristics and Benefits
• Valid means that the discovered patterns should hold true on new data with a
sufficient degree of certainty.
• Novel means that the patterns are not previously known to the user within the
context of the system being analyzed.
• Potentially useful means that the discovered patterns should lead to some benefit to
the user or task.
• Ultimately understandable means that the pattern should make business sense
Data Mining – Definitions, characteristics, and Benefits
• Data mining is not a new discipline, but rather a new definition for the use of many
disciplines. Data mining is tightly positioned at the intersection of many
disciplines, including statistics, artificial intelligence, machine learning,
management science, information systems, and databases (see Figure).
How Data Mining Works
Generally speaking, data mining tasks can be classified into three main categories:
prediction, association, and clustering.
How Data Mining Works
1. Prediction: Prediction is commonly referred to as the act of telling about the future. A
term that is commonly associated with prediction is forecasting.
• Whereas prediction is largely experience and opinion-based, forecasting is data and
model-based.
• prediction can be named more specifically as classification (where the predicted thing,
such as tomorrow’s forecast, is a class label such as “rainy” or “sunny”) or regression
(where the predicted thing, such as tomorrow’s temperature, is a real number, such as
“65°F”).
a) Classification: Classification, or supervised induction, is perhaps the most common of
all data mining tasks. The objective of classification is to analyze the historical data
stored in a database and automatically generate a model that can predict future behavior.
• Common classification tools include neural networks and decision trees (from machine
learning), logistic regression and discriminant analysis (from traditional statistics), and
emerging tools such as rough sets, support vector machines, and genetic algorithms.
• Neural networks involve the development of mathematical structures (somewhat
resembling the biological neural networks in the human brain) that have the capability to
learn from past experiences presented in the form of well-structured data sets.
DATA MINING METHODS
Regression
• Regression is a statistical analysis
technique used to examine the
relationship between one or more
independent variables (predictors)
and a dependent variable
(outcome).
• For e.g.
• Estimating the probability that a
patient will die given the results
of a set of diagnostic tests, A Simple Linear Regression for the Loan
• predicting consumer demand for Data Set
a new product as a function of
advertising expenditure.
How Data Mining Works
2. Associations: discovering interesting relationships among variables in large
databases. Association rule mining is often called market-basket analysis.
• With link analysis, the linkage among many objects of interest is discovered
automatically, such as the link between Web pages and referential relationships
among groups of academic publication authors.
• With sequence mining, relationships are examined in terms of their order of
occurrence to identify associations over time
• Data sources for data selection can vary: include demographic data (such as income,
education, number of households, and age), sociographic data (such as hobby, club
membership, and entertainment), transactional data (sales record, credit card spending, issued
checks), and so on.
Data Mining Process
Step 3: Data Preparation: The purpose of data preparation (more commonly called
data preprocessing) is to take the data identified in the previous step and prepare it for
analysis by data mining methods.
Data Mining Process
Step 4: Model Building: various modeling techniques are selected and applied to an
already prepared data set in order to address the specific business need. The model-
building step also encompasses the assessment and comparative analysis of the various
models built.
• Depending on the business need, the data mining task can be of a prediction (either
classification or regression), an association, or a clustering type.
Step 5: Testing and Evaluation: the developed models are assessed and evaluated
for their accuracy and generality. This step assesses the degree to which the
selected model (or models) meets the business objectives and, if so, to what extent (i.e.,
do more models need to be developed and assessed).
• Limited Information
Inconclusive data causes problems because if some attributes essential to
knowledge about the application domain are not present in the data
it may be impossible to discover significant knowledge about a given domain.
For example cannot diagnose malaria from a patient database if that database does not
contain the patient’s red blood cell count.
• Uncertainty : refers to the severity of the error and the degree of noise in the
data
DATA MINING PROBLEMS/ISSUES
• Size, Updates, and Irrelevant Fields
The problem with this from the data mining perspective is how to
ensure that the rules are up-to-date and consistent with the
most current information.
Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the
‘timeliness’ of the data.
Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the
‘timeliness’ of the data.
POTENTIAL APPLICATIONS
Retailing and Logistics
• predict accurate sales volumes at specific retail locations in order to determine correct
inventory levels;
• identify sales relationships between different products (with market-basket analysis) to
improve the store layout and optimize sales promotions;
• forecast consumption levels of different product types (based on seasonal and
environmental conditions) to optimize logistics and, hence, maximize sales;
• discover interesting patterns in the movement of products (especially for the products that
have a limited shelf life because they are prone to expiration, perishability, and
contamination) in a supply chain by analyzing sensory and RFID data.
Banking
• automating the loan application process by accurately predicting the most probable
defaulters.
• detecting fraudulent credit card and online-banking transactions
• Identifying ways to maximize customer value by selling them products and services that
they are most likely to buy
• optimizing the cash return by accurately forecasting the cash flow on banking entities (e.g.,
ATM machines, banking branches).
POTENTIAL APPLICATIONS
Insurance and Health Care
• Forecast claim amounts for property and medical coverage costs for better business
planning
• determine optimal rate plans based on the analysis of claims and Customer data
• predict which customers are more likely to buy new policies with special features;
• identify and prevent incorrect claim payments and fraudulent activities.
Healthcare.
• identify people without health insurance and the factors underlying this undesired
Phenomenon.
• identify novel cost-benefit relationships between different treatments to develop
more effective strategies;
• forecast the level and the time of demand at different service locations to optimally
allocate organizational resources;
• understand the underlying reasons for customer and employee attrition.
POTENTIAL APPLICATIONS
Entertainment industry
• analyze viewer data to decide what programs to show during prime time and how
to maximize returns by knowing where to insert advertisements.
• predict the financial success of movies before they are produced to make
investment decisions and to optimize the returns.
• forecast the demand at different locations and different times to better schedule
entertainment events and to optimally allocate resources;
• develop optimal pricing policies to maximize revenues.
Q&A
• Define Data Mining.
• What recent factors have increased the popularity of data mining?
• Discuss the major characteristics and objectives of data mining.
• What are some major data mining methods and algorithms?
• Identify at least five specific applications of data mining and list five common
characteristics of these applications.
• What do you think is the most prominent application area for data mining? Why?
• What are the major data mining processes?
• Why do you think the early phases (understanding of the business and understanding
of the data) take the longest in data mining projects?
• Distinguish data mining from other analytical tools and techniques.
• Are data mining processes a mere sequential set of activities? Explain.
• What is the main difference between classification and clustering? Explain using
concrete examples
• Preparation of data is the most crucial step in data mining. Critically examine
the role of data preparation (data consolidation, data cleaning, data
transformation, and data reduction) in forming the data.