0% found this document useful (0 votes)
90 views21 pages

Unit 3 Data Mining

The document discusses data mining including definitions, processes, methods, and challenges. Data mining aims to extract useful patterns from large amounts of data. The key steps in data mining are business understanding, data understanding, data preparation, model building, evaluation, and deployment.

Uploaded by

badaltanwarr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views21 pages

Unit 3 Data Mining

The document discusses data mining including definitions, processes, methods, and challenges. Data mining aims to extract useful patterns from large amounts of data. The key steps in data mining are business understanding, data understanding, data preparation, model building, evaluation, and deployment.

Uploaded by

badaltanwarr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Mining

Why Data Mining : Some Reasons


• More intense competition at the global scale driven by customers’ ever-changing
needs and wants in an increasingly saturated marketplace.

• General recognition of the untapped value hidden in large data sources.

• Consolidation and integration of database records, which enables a single view


of customers, vendors, transactions, etc.

• Consolidation of databases and other data repositories into a single location in


the form of a data warehouse.

• The exponential increase in data processing and storage technologies.

• Significant reduction in the cost of hardware and software for data storage and
processing.
Data Mining – Definitions, characteristics and Benefits
• Simply defined Data mining is a term used to describe discovering or “mining”
knowledge from large amounts of data.

• Technically speaking, data mining is a process that uses statistical, mathematical, and
artificial intelligence techniques to extract and identify useful information and
subsequent knowledge (or patterns) from large sets of data.

• These patterns can be in the form of business rules, affinities, correlations, trends, or
prediction models

• Most literature defines data mining as “the nontrivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns in data stored in
structured databases,” In this definition, the meanings of the key term are as follows:
Data Mining – Definitions, characteristics and Benefits

• Process implies that data mining comprises many iterative steps.

• Nontrivial means that some experimentation-type search or inference is involved;


that is, it is not as straightforward as a computation of predefined quantities.

• Valid means that the discovered patterns should hold true on new data with a
sufficient degree of certainty.

• Novel means that the patterns are not previously known to the user within the
context of the system being analyzed.

• Potentially useful means that the discovered patterns should lead to some benefit to
the user or task.
• Ultimately understandable means that the pattern should make business sense
Data Mining – Definitions, characteristics, and Benefits

• Data mining is not a new discipline, but rather a new definition for the use of many
disciplines. Data mining is tightly positioned at the intersection of many
disciplines, including statistics, artificial intelligence, machine learning,
management science, information systems, and databases (see Figure).
How Data Mining Works

• In general, data mining seeks to identify four major types of patterns:

1. Associations find the commonly co-occurring groupings of things, such as beer


and diapers going together in market-basket analysis.
2. Predictions tell the nature of future occurrences of certain events based on what
Has happened in the past, such as predicting the winner of the Super Bowl or
forecasting the absolute temperature of a particular day.
3. Clusters identify natural groupings of things based on their known characteristics,
such as assigning customers in different segments based on their demographics and
past purchase behaviors.
4. Sequential relationships discover time-ordered events, such as predicting that an
existing banking customer who already has a checking account will open a savings
account followed by an investment account within a year.
How Data Mining Works

Generally speaking, data mining tasks can be classified into three main categories:
prediction, association, and clustering.
How Data Mining Works
1. Prediction: Prediction is commonly referred to as the act of telling about the future. A
term that is commonly associated with prediction is forecasting.
• Whereas prediction is largely experience and opinion-based, forecasting is data and
model-based.
• prediction can be named more specifically as classification (where the predicted thing,
such as tomorrow’s forecast, is a class label such as “rainy” or “sunny”) or regression
(where the predicted thing, such as tomorrow’s temperature, is a real number, such as
“65°F”).
a) Classification: Classification, or supervised induction, is perhaps the most common of
all data mining tasks. The objective of classification is to analyze the historical data
stored in a database and automatically generate a model that can predict future behavior.
• Common classification tools include neural networks and decision trees (from machine
learning), logistic regression and discriminant analysis (from traditional statistics), and
emerging tools such as rough sets, support vector machines, and genetic algorithms.
• Neural networks involve the development of mathematical structures (somewhat
resembling the biological neural networks in the human brain) that have the capability to
learn from past experiences presented in the form of well-structured data sets.
DATA MINING METHODS

Regression
• Regression is a statistical analysis
technique used to examine the
relationship between one or more
independent variables (predictors)
and a dependent variable
(outcome).
• For e.g.
• Estimating the probability that a
patient will die given the results
of a set of diagnostic tests, A Simple Linear Regression for the Loan
• predicting consumer demand for Data Set
a new product as a function of
advertising expenditure.
How Data Mining Works
2. Associations: discovering interesting relationships among variables in large
databases. Association rule mining is often called market-basket analysis.
• With link analysis, the linkage among many objects of interest is discovered
automatically, such as the link between Web pages and referential relationships
among groups of academic publication authors.
• With sequence mining, relationships are examined in terms of their order of
occurrence to identify associations over time

3. Clustering: Clustering partitions a collection of things


(e.g., objects, events, etc., presented in a structured
data set) into segments (or natural groupings) whose
members share similar characteristics. Unlike in
classification, in clustering the class
labels are unknown. As the selected algorithm
goes through the data set, identifying
the commonalities of things based on their
characteristics, the clusters are established
Data Mining Process
Data Mining Process
Step 1: Business Understanding :
• Understanding of the managerial need for new knowledge and an explicit specification of the
business objective.
• Specific goals such as “What are the common characteristics of the customers we have lost to
our competitors recently?” or “What are typical profiles of our customers, and how much
value does each of them provide to us?” are needed.
• Then a project plan for finding such knowledge is developed that specifies the people
responsible for collecting the data, analyzing the data, and reporting the findings.
• a budget to support the study should also be established

Step 2: Data Understanding :


• To better understand the data, the analyst often uses a variety of statistical and graphical
techniques, such as simple statistical summaries of each variable (e.g., for numeric
variables the average, minimum/maximum, median, and standard deviation are among the
calculated measures, whereas for categorical variables the mode and frequency tables are
calculated), correlation analysis, scatterplots, histograms, and box plots.
• A careful identification and selection of data sources and the most relevant variables can
make it easier for data mining algorithms to quickly discover useful knowledge patterns.

• Data sources for data selection can vary: include demographic data (such as income,
education, number of households, and age), sociographic data (such as hobby, club
membership, and entertainment), transactional data (sales record, credit card spending, issued
checks), and so on.
Data Mining Process
Step 3: Data Preparation: The purpose of data preparation (more commonly called
data preprocessing) is to take the data identified in the previous step and prepare it for
analysis by data mining methods.
Data Mining Process
Step 4: Model Building: various modeling techniques are selected and applied to an
already prepared data set in order to address the specific business need. The model-
building step also encompasses the assessment and comparative analysis of the various
models built.
• Depending on the business need, the data mining task can be of a prediction (either
classification or regression), an association, or a clustering type.

Step 5: Testing and Evaluation: the developed models are assessed and evaluated
for their accuracy and generality. This step assesses the degree to which the
selected model (or models) meets the business objectives and, if so, to what extent (i.e.,
do more models need to be developed and assessed).

Step 6: Deployment: Depending on the requirements, the deployment phase can be as


simple as generating a report or as complex as implementing a repeatable datamining
process across the enterprise.
• The deployment step may also include maintenance activities for the deployed
models.
DATA MINING PROBLEMS/ISSUES

• Limited Information
 Inconclusive data causes problems because if some attributes essential to
knowledge about the application domain are not present in the data
 it may be impossible to discover significant knowledge about a given domain.
 For example cannot diagnose malaria from a patient database if that database does not
contain the patient’s red blood cell count.

• Noise and Missing Values


Missing data can be treated by discovery systems in a number of ways such as;
 Simply disregard missing values
 Omit the corresponding records
 Infer missing values from known values
 Treat missing data as a special value to be included additionally in the attribute
domain
 Average over the missing values using Bayesian techniques.

• Uncertainty : refers to the severity of the error and the degree of noise in the
data
DATA MINING PROBLEMS/ISSUES
• Size, Updates, and Irrelevant Fields
 The problem with this from the data mining perspective is how to
ensure that the rules are up-to-date and consistent with the
most current information.
 Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the
‘timeliness’ of the data.
 Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the
‘timeliness’ of the data.
POTENTIAL APPLICATIONS
Retailing and Logistics
• predict accurate sales volumes at specific retail locations in order to determine correct
inventory levels;
• identify sales relationships between different products (with market-basket analysis) to
improve the store layout and optimize sales promotions;
• forecast consumption levels of different product types (based on seasonal and
environmental conditions) to optimize logistics and, hence, maximize sales;
• discover interesting patterns in the movement of products (especially for the products that
have a limited shelf life because they are prone to expiration, perishability, and
contamination) in a supply chain by analyzing sensory and RFID data.

Banking
• automating the loan application process by accurately predicting the most probable
defaulters.
• detecting fraudulent credit card and online-banking transactions
• Identifying ways to maximize customer value by selling them products and services that
they are most likely to buy
• optimizing the cash return by accurately forecasting the cash flow on banking entities (e.g.,
ATM machines, banking branches).
POTENTIAL APPLICATIONS
Insurance and Health Care
• Forecast claim amounts for property and medical coverage costs for better business
planning
• determine optimal rate plans based on the analysis of claims and Customer data
• predict which customers are more likely to buy new policies with special features;
• identify and prevent incorrect claim payments and fraudulent activities.

Healthcare.
• identify people without health insurance and the factors underlying this undesired
Phenomenon.
• identify novel cost-benefit relationships between different treatments to develop
more effective strategies;
• forecast the level and the time of demand at different service locations to optimally
allocate organizational resources;
• understand the underlying reasons for customer and employee attrition.
POTENTIAL APPLICATIONS
Entertainment industry
• analyze viewer data to decide what programs to show during prime time and how
to maximize returns by knowing where to insert advertisements.
• predict the financial success of movies before they are produced to make
investment decisions and to optimize the returns.
• forecast the demand at different locations and different times to better schedule
entertainment events and to optimally allocate resources;
• develop optimal pricing policies to maximize revenues.
Q&A
• Define Data Mining.
• What recent factors have increased the popularity of data mining?
• Discuss the major characteristics and objectives of data mining.
• What are some major data mining methods and algorithms?
• Identify at least five specific applications of data mining and list five common
characteristics of these applications.
• What do you think is the most prominent application area for data mining? Why?
• What are the major data mining processes?
• Why do you think the early phases (understanding of the business and understanding
of the data) take the longest in data mining projects?
• Distinguish data mining from other analytical tools and techniques.
• Are data mining processes a mere sequential set of activities? Explain.
• What is the main difference between classification and clustering? Explain using
concrete examples
• Preparation of data is the most crucial step in data mining. Critically examine
the role of data preparation (data consolidation, data cleaning, data
transformation, and data reduction) in forming the data.

You might also like