0% found this document useful (0 votes)

90 views21 pages

Unit 3 Data Mining

The document discusses data mining including definitions, processes, methods, and challenges. Data mining aims to extract useful patterns from large amounts of data. The key steps in data mining are business understanding, data understanding, data preparation, model building, evaluation, and deployment.

Uploaded by

badaltanwarr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views21 pages

Unit 3 Data Mining

Uploaded by

badaltanwarr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Mining

Why Data Mining : Some Reasons

• More intense competition at the global scale driven by customers’ ever-changing
needs and wants in an increasingly saturated marketplace.

• General recognition of the untapped value hidden in large data sources.

• Consolidation and integration of database records, which enables a single view

of customers, vendors, transactions, etc.

• Consolidation of databases and other data repositories into a single location in

the form of a data warehouse.

• The exponential increase in data processing and storage technologies.

• Significant reduction in the cost of hardware and software for data storage and
processing.
Data Mining – Definitions, characteristics and Benefits
• Simply defined Data mining is a term used to describe discovering or “mining”
knowledge from large amounts of data.

• Technically speaking, data mining is a process that uses statistical, mathematical, and
artificial intelligence techniques to extract and identify useful information and
subsequent knowledge (or patterns) from large sets of data.

• These patterns can be in the form of business rules, affinities, correlations, trends, or
prediction models

• Most literature defines data mining as “the nontrivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns in data stored in
structured databases,” In this definition, the meanings of the key term are as follows:
Data Mining – Definitions, characteristics and Benefits

• Process implies that data mining comprises many iterative steps.

• Nontrivial means that some experimentation-type search or inference is involved;

that is, it is not as straightforward as a computation of predefined quantities.

• Valid means that the discovered patterns should hold true on new data with a
sufficient degree of certainty.

• Novel means that the patterns are not previously known to the user within the
context of the system being analyzed.

• Potentially useful means that the discovered patterns should lead to some benefit to
the user or task.
• Ultimately understandable means that the pattern should make business sense
Data Mining – Definitions, characteristics, and Benefits

• Data mining is not a new discipline, but rather a new definition for the use of many
disciplines. Data mining is tightly positioned at the intersection of many
disciplines, including statistics, artificial intelligence, machine learning,
management science, information systems, and databases (see Figure).
How Data Mining Works

• In general, data mining seeks to identify four major types of patterns:

1. Associations find the commonly co-occurring groupings of things, such as beer

and diapers going together in market-basket analysis.
2. Predictions tell the nature of future occurrences of certain events based on what
Has happened in the past, such as predicting the winner of the Super Bowl or
forecasting the absolute temperature of a particular day.
3. Clusters identify natural groupings of things based on their known characteristics,
such as assigning customers in different segments based on their demographics and
past purchase behaviors.
4. Sequential relationships discover time-ordered events, such as predicting that an
existing banking customer who already has a checking account will open a savings
account followed by an investment account within a year.
How Data Mining Works

Generally speaking, data mining tasks can be classified into three main categories:
prediction, association, and clustering.
How Data Mining Works
1. Prediction: Prediction is commonly referred to as the act of telling about the future. A
term that is commonly associated with prediction is forecasting.
• Whereas prediction is largely experience and opinion-based, forecasting is data and
model-based.
• prediction can be named more specifically as classification (where the predicted thing,
such as tomorrow’s forecast, is a class label such as “rainy” or “sunny”) or regression
(where the predicted thing, such as tomorrow’s temperature, is a real number, such as
“65°F”).
a) Classification: Classification, or supervised induction, is perhaps the most common of
all data mining tasks. The objective of classification is to analyze the historical data
stored in a database and automatically generate a model that can predict future behavior.
• Common classification tools include neural networks and decision trees (from machine
learning), logistic regression and discriminant analysis (from traditional statistics), and
emerging tools such as rough sets, support vector machines, and genetic algorithms.
• Neural networks involve the development of mathematical structures (somewhat
resembling the biological neural networks in the human brain) that have the capability to
learn from past experiences presented in the form of well-structured data sets.
DATA MINING METHODS

Regression
• Regression is a statistical analysis
technique used to examine the
relationship between one or more
independent variables (predictors)
and a dependent variable
(outcome).
• For e.g.
• Estimating the probability that a
patient will die given the results
of a set of diagnostic tests, A Simple Linear Regression for the Loan
• predicting consumer demand for Data Set
a new product as a function of
advertising expenditure.
How Data Mining Works
2. Associations: discovering interesting relationships among variables in large
databases. Association rule mining is often called market-basket analysis.
• With link analysis, the linkage among many objects of interest is discovered
automatically, such as the link between Web pages and referential relationships
among groups of academic publication authors.
• With sequence mining, relationships are examined in terms of their order of
occurrence to identify associations over time

3. Clustering: Clustering partitions a collection of things

(e.g., objects, events, etc., presented in a structured
data set) into segments (or natural groupings) whose
members share similar characteristics. Unlike in
classification, in clustering the class
labels are unknown. As the selected algorithm
goes through the data set, identifying
the commonalities of things based on their
characteristics, the clusters are established
Data Mining Process
Data Mining Process
Step 1: Business Understanding :
• Understanding of the managerial need for new knowledge and an explicit specification of the
business objective.
• Specific goals such as “What are the common characteristics of the customers we have lost to
our competitors recently?” or “What are typical profiles of our customers, and how much
value does each of them provide to us?” are needed.
• Then a project plan for finding such knowledge is developed that specifies the people
responsible for collecting the data, analyzing the data, and reporting the findings.
• a budget to support the study should also be established

Step 2: Data Understanding :

• To better understand the data, the analyst often uses a variety of statistical and graphical
techniques, such as simple statistical summaries of each variable (e.g., for numeric
variables the average, minimum/maximum, median, and standard deviation are among the
calculated measures, whereas for categorical variables the mode and frequency tables are
calculated), correlation analysis, scatterplots, histograms, and box plots.
• A careful identification and selection of data sources and the most relevant variables can
make it easier for data mining algorithms to quickly discover useful knowledge patterns.

• Data sources for data selection can vary: include demographic data (such as income,
education, number of households, and age), sociographic data (such as hobby, club
membership, and entertainment), transactional data (sales record, credit card spending, issued
checks), and so on.
Data Mining Process
Step 3: Data Preparation: The purpose of data preparation (more commonly called
data preprocessing) is to take the data identified in the previous step and prepare it for
analysis by data mining methods.
Data Mining Process
Step 4: Model Building: various modeling techniques are selected and applied to an
already prepared data set in order to address the specific business need. The model-
building step also encompasses the assessment and comparative analysis of the various
models built.
• Depending on the business need, the data mining task can be of a prediction (either
classification or regression), an association, or a clustering type.

Step 5: Testing and Evaluation: the developed models are assessed and evaluated
for their accuracy and generality. This step assesses the degree to which the
selected model (or models) meets the business objectives and, if so, to what extent (i.e.,
do more models need to be developed and assessed).

Step 6: Deployment: Depending on the requirements, the deployment phase can be as

simple as generating a report or as complex as implementing a repeatable datamining
process across the enterprise.
• The deployment step may also include maintenance activities for the deployed
models.
DATA MINING PROBLEMS/ISSUES

• Limited Information
 Inconclusive data causes problems because if some attributes essential to
knowledge about the application domain are not present in the data
 it may be impossible to discover significant knowledge about a given domain.
 For example cannot diagnose malaria from a patient database if that database does not
contain the patient’s red blood cell count.

• Noise and Missing Values

Missing data can be treated by discovery systems in a number of ways such as;
 Simply disregard missing values
 Omit the corresponding records
 Infer missing values from known values
 Treat missing data as a special value to be included additionally in the attribute
domain
 Average over the missing values using Bayesian techniques.

• Uncertainty : refers to the severity of the error and the degree of noise in the
data
DATA MINING PROBLEMS/ISSUES
• Size, Updates, and Irrelevant Fields
 The problem with this from the data mining perspective is how to
ensure that the rules are up-to-date and consistent with the
most current information.
 Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the
‘timeliness’ of the data.
 Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the
‘timeliness’ of the data.
POTENTIAL APPLICATIONS
Retailing and Logistics
• predict accurate sales volumes at specific retail locations in order to determine correct
inventory levels;
• identify sales relationships between different products (with market-basket analysis) to
improve the store layout and optimize sales promotions;
• forecast consumption levels of different product types (based on seasonal and
environmental conditions) to optimize logistics and, hence, maximize sales;
• discover interesting patterns in the movement of products (especially for the products that
have a limited shelf life because they are prone to expiration, perishability, and
contamination) in a supply chain by analyzing sensory and RFID data.

Banking
• automating the loan application process by accurately predicting the most probable
defaulters.
• detecting fraudulent credit card and online-banking transactions
• Identifying ways to maximize customer value by selling them products and services that
they are most likely to buy
• optimizing the cash return by accurately forecasting the cash flow on banking entities (e.g.,
ATM machines, banking branches).
POTENTIAL APPLICATIONS
Insurance and Health Care
• Forecast claim amounts for property and medical coverage costs for better business
planning
• determine optimal rate plans based on the analysis of claims and Customer data
• predict which customers are more likely to buy new policies with special features;
• identify and prevent incorrect claim payments and fraudulent activities.

Healthcare.
• identify people without health insurance and the factors underlying this undesired
Phenomenon.
• identify novel cost-benefit relationships between different treatments to develop
more effective strategies;
• forecast the level and the time of demand at different service locations to optimally
allocate organizational resources;
• understand the underlying reasons for customer and employee attrition.
POTENTIAL APPLICATIONS
Entertainment industry
• analyze viewer data to decide what programs to show during prime time and how
to maximize returns by knowing where to insert advertisements.
• predict the financial success of movies before they are produced to make
investment decisions and to optimize the returns.
• forecast the demand at different locations and different times to better schedule
entertainment events and to optimally allocate resources;
• develop optimal pricing policies to maximize revenues.
Q&A
• Define Data Mining.
• What recent factors have increased the popularity of data mining?
• Discuss the major characteristics and objectives of data mining.
• What are some major data mining methods and algorithms?
• Identify at least five specific applications of data mining and list five common
characteristics of these applications.
• What do you think is the most prominent application area for data mining? Why?
• What are the major data mining processes?
• Why do you think the early phases (understanding of the business and understanding
of the data) take the longest in data mining projects?
• Distinguish data mining from other analytical tools and techniques.
• Are data mining processes a mere sequential set of activities? Explain.
• What is the main difference between classification and clustering? Explain using
concrete examples
• Preparation of data is the most crucial step in data mining. Critically examine
the role of data preparation (data consolidation, data cleaning, data
transformation, and data reduction) in forming the data.

Data Mining Tasks
No ratings yet
Data Mining Tasks
26 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Science - Unit II
100% (2)
Data Science - Unit II
173 pages
ISM Session 1-8+webinar1,2 Merged
No ratings yet
ISM Session 1-8+webinar1,2 Merged
718 pages
Sensitivity Specificity PPV and NPV
100% (1)
Sensitivity Specificity PPV and NPV
4 pages
The Difference Equation As The Predator-Prey Model
0% (1)
The Difference Equation As The Predator-Prey Model
22 pages
WATER-QUALITY-PREDICTION-USING-MACHINE-LEARNING-TECHNIQUE
No ratings yet
WATER-QUALITY-PREDICTION-USING-MACHINE-LEARNING-TECHNIQUE
9 pages
Unit-3 DMDW
No ratings yet
Unit-3 DMDW
36 pages
Laporan Penerapan Teori Statistika Dalam Penjualan Produk
No ratings yet
Laporan Penerapan Teori Statistika Dalam Penjualan Produk
45 pages
Abra Raw
No ratings yet
Abra Raw
71 pages
(Part 2) 6.3 Normal Approximation of Binomial Probabilities: Continuity Correction Factor
No ratings yet
(Part 2) 6.3 Normal Approximation of Binomial Probabilities: Continuity Correction Factor
7 pages
Earth Science (Big) Data Analytics: March 2018
No ratings yet
Earth Science (Big) Data Analytics: March 2018
37 pages
Data Mining Techniques & Applications
No ratings yet
Data Mining Techniques & Applications
48 pages
Distributed System
100% (1)
Distributed System
119 pages
Machine Learning With Python Unit 1-17-84 Final13092024
No ratings yet
Machine Learning With Python Unit 1-17-84 Final13092024
68 pages
Lecture 1
100% (1)
Lecture 1
21 pages
Unit 7 - Time Series
No ratings yet
Unit 7 - Time Series
33 pages
Key Features of Data Mining
No ratings yet
Key Features of Data Mining
1 page
The Sampling Distribution of The Sample Mean
No ratings yet
The Sampling Distribution of The Sample Mean
22 pages
Probability Models in Marketing
No ratings yet
Probability Models in Marketing
66 pages
Assignment
75% (4)
Assignment
13 pages
Linear Regression 18may
No ratings yet
Linear Regression 18may
28 pages
OMIS1000 Midterm F08
No ratings yet
OMIS1000 Midterm F08
15 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Chapter 5 - Data Exploration and Visualization With
No ratings yet
Chapter 5 - Data Exploration and Visualization With
39 pages
Time Series Analysis and Its Applications: With R Examples: Second Edition
No ratings yet
Time Series Analysis and Its Applications: With R Examples: Second Edition
18 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
CH 6
No ratings yet
CH 6
72 pages
Unit - 4 Machine Learning
100% (1)
Unit - 4 Machine Learning
84 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Maroma Project
No ratings yet
Maroma Project
42 pages
DataMining S
No ratings yet
DataMining S
103 pages
Count Trip Generation Models
No ratings yet
Count Trip Generation Models
20 pages
Sampling: By: Kachiri T. Salibio-Mercadal
No ratings yet
Sampling: By: Kachiri T. Salibio-Mercadal
49 pages
Predict 422 - Module 8
100% (1)
Predict 422 - Module 8
138 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Fakulti Teknologi Kejuruteraan Mekanikal Dan Pembuatan Universiti Teknikal Malaysia Melaka
No ratings yet
Fakulti Teknologi Kejuruteraan Mekanikal Dan Pembuatan Universiti Teknikal Malaysia Melaka
15 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
15 Sidm
No ratings yet
15 Sidm
7 pages
KNN Presentation
No ratings yet
KNN Presentation
16 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
23 pages
Introduction To Statistics: Prepared By: Engr. Gilbey'S Jhon - Ladion Instructor
No ratings yet
Introduction To Statistics: Prepared By: Engr. Gilbey'S Jhon - Ladion Instructor
25 pages
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
No ratings yet
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
11 pages
Chapter 16: Time-Series Forecasting
No ratings yet
Chapter 16: Time-Series Forecasting
48 pages
M5 - Problem Set - Introduction To Statistics-2021 - Lagios
No ratings yet
M5 - Problem Set - Introduction To Statistics-2021 - Lagios
12 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
BUS 312: Management Theory: Wk3 Topic: The Major Classifications of Management Theory
No ratings yet
BUS 312: Management Theory: Wk3 Topic: The Major Classifications of Management Theory
22 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Lecture7 Slides
No ratings yet
Lecture7 Slides
6 pages
M.tech - Artificial Intelligence and Data Science
No ratings yet
M.tech - Artificial Intelligence and Data Science
57 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
Cluster
100% (1)
Cluster
72 pages
Matida Statistics
No ratings yet
Matida Statistics
3 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
Association Rules
No ratings yet
Association Rules
64 pages
Demand Forecasting
No ratings yet
Demand Forecasting
48 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Data Mining
No ratings yet
Data Mining
27 pages
02 HW Answer Key C
No ratings yet
02 HW Answer Key C
2 pages
Ethics
No ratings yet
Ethics
2 pages
Bit 2201 Simulation and Modeling
No ratings yet
Bit 2201 Simulation and Modeling
3 pages
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
No ratings yet
Market Basket Analysis and Advanced Data Mining: Professor Amit Basu
24 pages
Six Sigma GB Workshop Assessment Questions
100% (2)
Six Sigma GB Workshop Assessment Questions
5 pages
Feature Engineering
No ratings yet
Feature Engineering
9 pages
Ten Commandments For Dealing With Confounding
No ratings yet
Ten Commandments For Dealing With Confounding
1 page
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Heart Prediction
No ratings yet
Heart Prediction
15 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Example of 2D Convolution
No ratings yet
Example of 2D Convolution
5 pages
Ai Project: Water Jug Problem
No ratings yet
Ai Project: Water Jug Problem
5 pages
Session 18 Time Series Forecasting
No ratings yet
Session 18 Time Series Forecasting
30 pages
ST2195 Programming For Data Science
No ratings yet
ST2195 Programming For Data Science
11 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
X-Bar and R Chart Example In-Class Exercise
No ratings yet
X-Bar and R Chart Example In-Class Exercise
6 pages
Forecast
No ratings yet
Forecast
82 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
Data Transformation and Arima Models A S
No ratings yet
Data Transformation and Arima Models A S
8 pages
Building Recommendation System Using Movielens Data
No ratings yet
Building Recommendation System Using Movielens Data
6 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Unit II Requirements Elicitation
No ratings yet
Unit II Requirements Elicitation
23 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Eilidh Troup
No ratings yet
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet

Unit 3 Data Mining

Uploaded by

Unit 3 Data Mining

Uploaded by

Data Mining

Why Data Mining : Some Reasons

• General recognition of the untapped value hidden in large data sources.

• Consolidation and integration of database records, which enables a single view

• Consolidation of databases and other data repositories into a single location in

• The exponential increase in data processing and storage technologies.

• Process implies that data mining comprises many iterative steps.

• Nontrivial means that some experimentation-type search or inference is involved;

• In general, data mining seeks to identify four major types of patterns:

1. Associations find the commonly co-occurring groupings of things, such as beer

3. Clustering: Clustering partitions a collection of things

Step 2: Data Understanding :

Step 6: Deployment: Depending on the requirements, the deployment phase can be as

• Noise and Missing Values

You might also like