0% found this document useful (0 votes)

6 views

02-Data_Mining_The_Data_Mining_Process

The document discusses the Knowledge Discovery and Data Mining (KDD) process, defining it as a nontrivial process of identifying valid and useful patterns in data. It outlines the steps involved in KDD, including data selection, preprocessing, mining, and interpretation, and emphasizes the importance of understanding data types and attributes in the mining process. Additionally, it introduces the CRISP-DM methodology as a structured approach to guide data mining projects.

Uploaded by

Ahmed Ajebli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

02-Data_Mining_The_Data_Mining_Process

Uploaded by

Ahmed Ajebli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Knowledge Discovery and Data Mining

The Data Mining Process

EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024

Knowledge Discovery and Data Mining

• Knowledge Discovery from Data (KDD) since 1989
• Definition : KDD process (U. Fayyad et al. 1996):
• “The nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data”.
• KDD refers to the overall process of discovering useful knowledge from data.
• Data Mining refers to a particular step in this process.
• Data mining is the application of specific algorithms for extracting patterns
from data.
• Currently KDD and Data Mining are used as equivalent
E. Zemmouri

• Data Mining is a complex and multistage process.

2
machine learning, pattern recognition, statistics, artifi- modeling algorithms for large noisy
cial intelligence and reasoning with uncertainty, knowl- datasets are also of fundamental interest.
edge acquisition for expert systems, data visualization, Statistics has much in common with KDD.
machine discovery [7], scientific discovery, information Inference of knowledge from data has a fundamental
retrieval, and high-performance computing. KDD soft- statistical component (see [2] and the article by Gly-
ware systems incorporate theories, algorithms, and mour on statistical inference in this special section for
methods from all of these fields. more detailed discussions of the relationship between
Database theories and tools provide the necessary KDD and statistics). Statistics provides a language and
The process : U. Fayyad et al. 1996
infrastructure to store, access, and manipulate data. framework for quantifying the uncertainty resulting
Data warehousing, a recently popularized term, refers when one tries to infer general patterns from a partic-
to the current business trend of collecting and cleaning ular sample of an overall population. As mentioned
transactional data to make them available for online earlier, the term data mining has had negative conno-
• Overview
analysis of the
and decision steps
support. constituting
A popular approach the
forKDD process
tations in statistics since the 1960s, when computer-
analysis• ofThe
data
KDDwarehouses
process isisinteractive
called online
andanalytical
iterative based data analysis techniques were first introduced.
processing (OLAP).1 OLAP tools focus on providing The concern arose over the fact that if one searches

Figure 1. Overview of the steps constituting the KDD process

Pre- Trans- Data Interpretation/

Selection processing formation Mining Evaluation

Target Preprocessed Transformed Patterns Knowledge

Data Data Data
Data

E. Zemmouri
multidimensional data analysis, which is superior to long enough in any dataset (even randomly generated
SQL (a standard data manipulation language) in com- data), one can find patterns that appear to be statisti-
puting summaries and breakdowns along many dimen- cally significant but in fact are not. This issue is of fun- 3
sions. While current OLAP tools target interactive data damental importance to KDD. There has been
analysis, we expect they will also include more auto- substantial progress in understanding such issues in sta-
mated discovery components in the near future. tistics in recent years, much directly relevant to KDD.
Fields concerned with inferring models from data— Thus, data mining is a legitimate activity as long as one
including statistical pattern recognition, applied statis- understands how to do it correctly. KDD can also be
tics, machine learning, and neural networks—were the viewed as encompassing a broader view of modeling
impetus for much early KDD work. KDD largely relies than statistics, aiming to provide tools to automate (to
on methods from these fields to find patterns from data the degree possible) the entire process of data analysis,
in the data mining step of the KDD process. A natural including the statistician’s art of hypothesis selection.
question is: How is KDD different from these other
The process : U. Fayyad et al. 1996
fields? KDD focuses on the overall process of knowl- The KDD Process
edge discovery from data, including how the data is Here we present our (necessarily subjective) perspec-
stored and accessed, how algorithms can be scaled to tive of a unifying process-centric framework for KDD.
• KDD
massive process
datasets and still steps : how results can The goal is to provide an overview of the variety of activ-
run efficiently,
be interpreted and visualized, and how the overall 1
1. Learning
human-machine thecan
interaction application
be modeled domain.
See Providing OLAP to User Analysts: An IT Mandate by E.F. Codd and
and sup- Associates (1993).
2. Creating a target dataset.
COMMUNICATIONS OF THE ACM
Preprocessing29
November 1996/Vol. 39, No. 11

3. Data cleaning and preprocessing.

4. Data reduction and projection.
5. Choosing the function (task) of data mining.
6. Choosing the data mining algorithm(s). Data Mining
7. Data mining.
E. Zemmouri

8. Interpretation.
Postprocessing
9. Using discovered knowledge.
4
The process : C. Aggarwal 2015

• The workflow of a typical data mining application

4 CHAPTER 1. AN INTRODUCTION TO DATA MINING

DATA
PREPROCESSING ANALYTICAL PROCESSING
DATA OUTPUT
COLLECTION CLEANING FOR
FEATURE BUILDING BUILDING
AND ANALYST
EXTRACTION BLOCK 1 BLOCK 2
INTEGRATION

FEEDBACK (OPTIONAL)

E. Zemmouri
FEEDBACK ((OPTIONAL)

Figure 1.1: The data processing pipeline

5
possible to directly use a standard data mining problem, such as the four “superprob-
lems” discussed earlier, for the application at hand. However, these four problems have
such wide coverage that many applications can be broken up into components that
use these different building blocks. This book will provide examples of this process.
The overall data mining process is illustrated in Fig. 1.1. Note that the analytical block in
Fig. 1.1 shows multiple building blocks representing the design of the solution to a particular
application. This part of the algorithmic design is dependent on the skill of the analyst and
often uses one or more of the four major problems as a building block. This is, of course,
not always the case, but it is frequent enough to merit special treatment of these four
problems within this book. To explain the data mining process, we will use an example
from a recommendation scenario.
The process : C. Aggarwal 2015
Example 1.2.1 Consider a scenario in which a retailer has Web logs corresponding to
customer accesses to Web pages at his or her site. Each of these Web pages corresponds
• A to
typical data
a product, andmining
therefore application contains
a customer access to a page the following
may often phases
be indicative of interest
in that particular product. The retailer also stores demographic profiles for the different
1. Data collection
customers. The retailer wants to make targeted product recommendations to customers using
the customer demographics and buying behavior.
2. Data preprocessing è make date suitable for processing
Sample Solution Pipeline In this case, the first step for the analyst is to collect the
a. Feature extraction
relevant data from two different sources. The first source is the set of Web logs at the
site.b.TheData cleaning
second is the demographic information within the retailer database that were
collected during Web registration of the customer. Unfortunately, these data sets are in
c. Feature selection and transformation
a very different format and cannot easily be used together for processing. For example,
3. Analytical
consider a sampleprocessing
log entry ofand algorithms
the following form:
98.206.207.157 - - [31/Jul/2013:18:09:38 -0700] "GET /productA.htm
HTTP/1.1" 200 328177 "-" "Mozilla/5.0 (Mac OS X) AppleWebKit/536.26
(KHTML, like Gecko) Version/6.0 Mobile/10B329 Safari/8536.25"
E. Zemmouri

"retailer.net"
The log may contain hundreds of thousands of such entries. Here, a customer at IP address
98.206.207.157 has accessed productA.htm. The customer from the IP address can be iden-
tified using the previous login information, by using cookies, or by the IP address itself, 6
but this may be a noisy process and may not always yield accurate results. The analyst
would need to design algorithms for deciding how to filter the different log entries and use
only those which provide accurate results as a part of the cleaning and extraction process.
Furthermore, the raw log contains a lot of additional information that is not necessarily
The process : CRISP-DM
• CRISP-DM
• CRoss-Industry Standard Process For Data Mining
• “An industry-proven way to guide your data mining efforts.”

• Conceived in 1996 by a consortium of 5 companies and first published in 1999

• Reported as the leading methodology used by Data Miners in many polls
• IBM developed ASUM-DM is an extended and refined CRISP-DM for Data
Mining/Predictive Analytics projects
• Analytics Solutions Unified Method

E. Zemmouri
• http://i2t.icesi.edu.co/ASUM-DM_External/index.htm

The process : CRISP-DM

• CRISP-DM : how to conduct a data mining

project

• As a methodology, CRISP-DM includes descriptions

of the typical phases of a project, the tasks
involved with each phase, and an explanation of
the relationships between these tasks.

• As a process model, CRISP-DM provides an

overview of the data mining project life cycle.
E. Zemmouri

8
CRISP-DM : data mining life cycle
• Business understanding
• Understanding the project objectives and requirements from a
business perspective.
• Converting this knowledge into a data mining problem
definition and a preliminary plan designed to achieve the
objectives.

• Data understanding
• Starts with an initial data collection.
• Get familiar with the data.

E. Zemmouri
• Identify data quality problems.
• Discover first insights into the data.

CRISP-DM : data mining life cycle

• Data preparation
• Covers all activities to construct the final dataset from the
initial raw data è data that will be fed into the modeling tools.
• Include tasks : data selection, transformation and cleaning.

• Modeling
• Various modeling techniques are selected and applied, and
their parameters are calibrated to optimal values.
• Typically, there are several techniques for the same data mining
problem type. Some techniques have specific requirements on
the form of data.
E. Zemmouri

• Stepping back to the data preparation phase is often necessary.

10
CRISP-DM : data mining life cycle
• Evaluation
• You have built a models that appears to have high quality from
a data analysis perspective!
• Evaluate the model and review the steps executed to be certain
it achieves the business objectives.
• Deployment
• Creation of the model is not the end of the project. the
knowledge gained will need to be organized and presented in a
way that the customer can use it.
• It often involves applying live models within an organization’s
decision-making processes.

E. Zemmouri
• The deployment phase can be as simple as generating a report
or as complex as implementing a repeatable data mining
process across the enterprise.

The Data Mining Process

Basic Data Types

Multidimensional Data

• Definition :

• A multidimensional dataset ! is a set of " records #! , #" , … , ## , such that

each record #$ contains a set of & features denoted as (#$! , #$" , … , #$% ).

• The dataset is described by & attributes )! , )" , … , )%

• We note * as the space of all possible records.

• Records are also called : data point, instance, example, transaction, entity, tuple, object,

E. Zemmouri
or feature-vector

• Attributes are also called : fields, dimensions, or features.

Multidimensional Data
• Types of attributes

Qualitative Binary Quantitative

Categorical Numeric

Nominal Ordinal Ratio Interval

q Product names q State (high, low / good, bad) q Weight / length q Localization
E. Zemmouri

q Brands / categories q Feeling (happy, sad, neutral) q Price q Temperature

q Countries/cities names q Category (gold, silver, bronze) q Count q IQ score
q Colors q Grade A, B, C, D q Income
q Annual sales

14
Types of attributes
• Defines the levels of measurement
• Possible attribute types :
• Qualitative :
• Nominal
• Ordinal
• Quantitative :
• Numeric / Interval

E. Zemmouri
15

Nominal quantities
• No relation is implied among nominal values
• è no ordering or distance measure
• Only equality tests can be performed
• Values are distinct symbols
• Values themselves serve only as labels or names
• Examples:
• Attribute : country values : Morocco, Algeria, Tunisia, …
• Attribute : color, values : red, green, blue
• Attribute : gender, values : male, female
E. Zemmouri

• Attribute : valid, values : yes, no

• Brands, categories, jobs, …
16
Ordinal quantities
• Impose order on values
• But: no distance between values defined
• Examples:
• Attribute : age, Values : child < young < adult
• Attribute : temperature Values : hot > mild > cool
• Attribute : grade Values : A, B, C, D
• Attribute : emotion Values : sad, neutral, happy
• Note: addition and subtraction don’t make sense

E. Zemmouri
17

Numeric Quantities (Interval)

• Interval quantities are not only ordered but measured in fixed and
equal units
• Examples:
• Attribute temperature expressed in degrees Celsius
• Attribute age expressed in years
• Count, weight, height, …
• Difference of two values makes sense
E. Zemmouri

18
Attribute types in practice

• Most schemes accommodate just two levels of measurement:

• Nominal and Numeric (Ordinal)
• Nominal attributes are also called categorical, enumerated, or
discrete
• Special case: binary attributes (Yes/No, 0/1, True/False)
• Only equality tests possible
• numeric attributes
• Values are ordered

E. Zemmouri
• Example : real, integer, …

Attribute types in practice

• Qualitative vs Quantitative ?
• Sometimes transformations are need !!
• ...
• Why Data Mining algorithms need to know about attributes types?
• Color > blue : doesn’t make sense !!
• Age > 30 : does
E. Zemmouri

20
Classification of dataMultidimensional
types: Data
Nominal, ordinal
• Types of attributes and quantitative
Classification of data types:
• N – Nominal (labels)Nominal,
Classification of dataand
ordinal types:
quantitative
• N – Nominal (labels)
– Fruits: : oranges, …
apples,
• Operations : = , !=
Nominal, ordinal
• N – Nominal (labels)and quantitative
• O – Ordered • N – Nominal (labels)
– Fruits: apples, oranges, …
• O – Ordinal :
– Quality of meat:– Grade
• Fruits:
A, AA,
O –apples,
AAA …
oranges,
Ordered
• Q – Interval • O – Ordered
• Operations : = , != ,(location
< , > , <=– , of
>= zeroofarbitrary)
Quality meat: Grade A, AA, AAA
– Quality of meat: Grade A, AA, AAA
– Dates: Jan 5, 2012; Qlocation: (LAT(location
47 LONGof 122)
• Q – Interval (location• of Q•–zero – arbitrary)
Interval Interval
(location zero arbitrary)
of zero arbitrary)
– Like a geometric point.
– Dates: Cannot
– Dates: compare
Jan location:
Jan 5, 2012; directly.
5, 2012;(LAT
location:
47 LONG(LAT
122)47 LONG 122)
• Operations
– Only: =differences
, != , < , >– ,(i.e.
<=–a,intervals)
Like >= , – , +point.
geometric , mean
may
Like a geometric bepoint.
Cannotcompared.
compare directly.
Cannot compare directly.
– Only differences (i.e. intervals) may be compared.
• Can
• measure
Q – Ratiodistances
(zero fixed) – Only differences (i.e. intervals) may be compared.
• Q – Ratio (zero fixed)
• Q – Ratio– (zero fixed)
Physical • Q – Ratio
measurement: (zero
length, mass… fixed)

E. Zemmouri
– Physical measurement: length, mass…
– Counts and amounts – Physical
– Counts measurement: length, mass…
and amounts
• Operations : = , != , < , >– ,Like
<=–a, geometric
>= , – ,and
Counts +vector,
, /amounts
, origin
meanis meaningful
– Like a geometric vector, origin[S.isS.meaningful
Stevens, on the theory of scales of measurements, 1946]
• Can measure ratios May or
2013 proportions
[S. S.–Stevens,
Like a geometric
on theCecilia
theoryvector,
Aragon, of UW
HCDE, origin of
scales is meaningful
measurements, 1946]4
May 2013 Cecilia Aragon, HCDE, UW[S. S. Stevens, on the theory of scales 4of measurements, 1946]
May 2013 Cecilia Aragon, HCDE, UW 4 21

Example
• Titanic Dataset
E. Zemmouri

22
Quiz
• Give an appropriate type for each of the following attributes :
• Student ID
• Department (GIP, GM, GC, …)
• Annual salary
• Marital status
• Number of children
• Rating (bad, medium, good) or stars

E. Zemmouri
23

The Data Mining Process

Data Mining Tasks

Major Data Mining Tasks
• Classification: predicting an item class
• Clustering: finding clusters in data
• Associations: e.g. A & B & C occur frequently
• Visualization: to facilitate human discovery
• Summarization: describing a group
• Deviation Detection: finding changes
• Estimation: predicting a continuous value
• Link Analysis: finding relationships

E. Zemmouri
• …

Supervised vs. Unsupervised

• Learning principal : “using a set of observations to uncover an underlying
process”
• Training dataset

• Supervised learning :
• Right answers are given in a training dataset
• All input data is labeled, and the algorithms learn to predict the output from the input data.

• Unsupervised learning:
• Input dataset is not labeled
E. Zemmouri

• All input data is unlabeled, and the algorithms learn to inherent structure from the input
data.

26
Supervised vs. Unsupervised

• Supervised learning : predictive data mining

• Classification

• Regression

• …

• Unsupervised learning: descriptive data mining

• Clustering

• Associations

E. Zemmouri
• …

Quiz
Of the following examples, which would you address using an
unsupervised learning algorithm?

1. Given email labeled as spam/not spam, learn a spam filter.

2. Given a set of news articles found on the web, group them into
set of articles about the same story.
3. Given a database of customer data, automatically discover market
segments and group customers into different market segments.
4. Given a dataset of patients diagnosed as either having diabetes or
E. Zemmouri

not, learn to classify new patients as having diabetes or not.

28
Quiz

• Read the comics

about an engineer
interview. Does he
used supervised or
unsupervised
learning ?

E. Zemmouri
29

Study Guide For Fundamentals of Engineering FE Electrical & Computer
100% (4)
Study Guide For Fundamentals of Engineering FE Electrical & Computer
380 pages
Training Plan in Taekwondo Elementary: Polomolok West District
No ratings yet
Training Plan in Taekwondo Elementary: Polomolok West District
3 pages
Fund_Data_Science (1)
No ratings yet
Fund_Data_Science (1)
91 pages
DWM 4
No ratings yet
DWM 4
23 pages
Intelligent Knowledge Discovery
No ratings yet
Intelligent Knowledge Discovery
4 pages
Data Mining
No ratings yet
Data Mining
25 pages
Unit 1
No ratings yet
Unit 1
43 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
DWDM-UNIT-2
No ratings yet
DWDM-UNIT-2
50 pages
Data Mining Versus Knowledge Discovery I
No ratings yet
Data Mining Versus Knowledge Discovery I
3 pages
unit_1
No ratings yet
unit_1
102 pages
UNIT - 1 Data Mining
No ratings yet
UNIT - 1 Data Mining
16 pages
Business Datamining and Warehousing
No ratings yet
Business Datamining and Warehousing
121 pages
SIMS 422: Knowledge Inference Systems & Applications
No ratings yet
SIMS 422: Knowledge Inference Systems & Applications
28 pages
Ch1 Overview Kdd_ml
No ratings yet
Ch1 Overview Kdd_ml
23 pages
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
No ratings yet
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
4 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
Unit 01 DWDM
No ratings yet
Unit 01 DWDM
105 pages
Data Mining For Humanity: An Overview
No ratings yet
Data Mining For Humanity: An Overview
4 pages
UNESCO Courses: Module On Knowledge Discovery and Data Mining
No ratings yet
UNESCO Courses: Module On Knowledge Discovery and Data Mining
28 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
Data Mining New
No ratings yet
Data Mining New
21 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Module1 DataMining Ktustudents - in
No ratings yet
Module1 DataMining Ktustudents - in
24 pages
Lesson 1
No ratings yet
Lesson 1
32 pages
Data Mining and KDD
No ratings yet
Data Mining and KDD
15 pages
Cap481 - Business Communication Unit 4
No ratings yet
Cap481 - Business Communication Unit 4
90 pages
Suraj R. Bhuyar: Presented by
No ratings yet
Suraj R. Bhuyar: Presented by
18 pages
1 Intor To DMW
No ratings yet
1 Intor To DMW
22 pages
Unit 3 Data Mining PDF
No ratings yet
Unit 3 Data Mining PDF
19 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Dmbi Unit-3
No ratings yet
Dmbi Unit-3
21 pages
Clustering Algorithm For Spatial Data Mining: An: A.Padmapriya, N.Subitha
No ratings yet
Clustering Algorithm For Spatial Data Mining: An: A.Padmapriya, N.Subitha
6 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Subject Data Warehouse
No ratings yet
Subject Data Warehouse
42 pages
DB-14
No ratings yet
DB-14
97 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
KDD
No ratings yet
KDD
3 pages
CIS 467 - Topic 1 - Introduction - 2020
No ratings yet
CIS 467 - Topic 1 - Introduction - 2020
79 pages
Lecture 7 - Introduction To Data Mining
No ratings yet
Lecture 7 - Introduction To Data Mining
31 pages
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
basic data mining tasks
No ratings yet
basic data mining tasks
12 pages
BIS 541 Ch01 20-21 S
No ratings yet
BIS 541 Ch01 20-21 S
129 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
Chapter-3 DATA MINING PDF
No ratings yet
Chapter-3 DATA MINING PDF
13 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
DM Module 1
No ratings yet
DM Module 1
11 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Chapter 1___Data Mining and Data Warehouse
No ratings yet
Chapter 1___Data Mining and Data Warehouse
44 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
DE Unit1_Introdcution_DE_8Jul24
No ratings yet
DE Unit1_Introdcution_DE_8Jul24
56 pages
dwm NOTES
No ratings yet
dwm NOTES
118 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
45 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
1.1 DM-intro
No ratings yet
1.1 DM-intro
25 pages
Challan Form
No ratings yet
Challan Form
72 pages
Framework For Building ML Systems: Crisp-Dm
No ratings yet
Framework For Building ML Systems: Crisp-Dm
28 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
5 pages
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
1-s2.0-S0167811621001154-main
No ratings yet
1-s2.0-S0167811621001154-main
21 pages
Cover Letter - System Analyst PDF
No ratings yet
Cover Letter - System Analyst PDF
1 page
An Update On Empirically Validated Therapies (1997) PDF
No ratings yet
An Update On Empirically Validated Therapies (1997) PDF
9 pages
Md. Saiful Islam: Sector-4, Uttara, Dhaka-1230 E-Mail
No ratings yet
Md. Saiful Islam: Sector-4, Uttara, Dhaka-1230 E-Mail
2 pages
Core Tools (Apqp, Dfmea, Pfmea, Control Plans, Ppap, SPC, Msa) 5 Days Training
No ratings yet
Core Tools (Apqp, Dfmea, Pfmea, Control Plans, Ppap, SPC, Msa) 5 Days Training
2 pages
Application Guide For International Students (Fall 2024 Graduate School)
No ratings yet
Application Guide For International Students (Fall 2024 Graduate School)
38 pages
The Teacher and The School Curriculum Practice Items
No ratings yet
The Teacher and The School Curriculum Practice Items
24 pages
MQF Level 3: Diploma in Mechanical Engineering
No ratings yet
MQF Level 3: Diploma in Mechanical Engineering
9 pages
Format of Bonafide Student Certificate
57% (7)
Format of Bonafide Student Certificate
1 page
TFN Answer Key Sas 6-10
No ratings yet
TFN Answer Key Sas 6-10
20 pages
Introduction To The OBASHI Methdology
No ratings yet
Introduction To The OBASHI Methdology
28 pages
File O1dx2g1z PDF
No ratings yet
File O1dx2g1z PDF
23 pages
UNDERSTANDING THE SELF
No ratings yet
UNDERSTANDING THE SELF
3 pages
Christinafranks
No ratings yet
Christinafranks
1 page
Reading Rubric
No ratings yet
Reading Rubric
2 pages
Sport Thesis Ideas
100% (3)
Sport Thesis Ideas
8 pages
Why I Want To Be A Physical Therapist
No ratings yet
Why I Want To Be A Physical Therapist
5 pages
Lesson Plan ELEM
No ratings yet
Lesson Plan ELEM
3 pages
Dictation Marathon
No ratings yet
Dictation Marathon
1 page
Ram Kakani - CV (Apr 2023) Web
No ratings yet
Ram Kakani - CV (Apr 2023) Web
39 pages
3164 9413 6 PB
No ratings yet
3164 9413 6 PB
16 pages
1 296 Peterson-P1 Split 1
No ratings yet
1 296 Peterson-P1 Split 1
400 pages
Automated Systems For Identification of Microorganisms
No ratings yet
Automated Systems For Identification of Microorganisms
26 pages
RA-014626 - PROFESSIONAL TEACHER - Secondary (Social Studies) - Tuguegarao - 3-2020 PDF
No ratings yet
RA-014626 - PROFESSIONAL TEACHER - Secondary (Social Studies) - Tuguegarao - 3-2020 PDF
86 pages
Theory of Cognitive Development by Jean Piaget
No ratings yet
Theory of Cognitive Development by Jean Piaget
2 pages
Ajuda Financeira para Certificado No EDX
No ratings yet
Ajuda Financeira para Certificado No EDX
2 pages
BBA BCA 2023 Odd Semester
No ratings yet
BBA BCA 2023 Odd Semester
2 pages
Critical Skills Occupations List - DETE
No ratings yet
Critical Skills Occupations List - DETE
1 page

02-Data_Mining_The_Data_Mining_Process

Uploaded by

02-Data_Mining_The_Data_Mining_Process

Uploaded by

Knowledge Discovery and Data Mining

The Data Mining Process

Knowledge Discovery and Data Mining

• Data Mining is a complex and multistage process.

Figure 1. Overview of the steps constituting the KDD process

Pre- Trans- Data Interpretation/

Target Preprocessed Transformed Patterns Knowledge

3. Data cleaning and preprocessing.

• The workflow of a typical data mining application

4 CHAPTER 1. AN INTRODUCTION TO DATA MINING

Figure 1.1: The data processing pipeline

• Conceived in 1996 by a consortium of 5 companies and first published in 1999

The process : CRISP-DM

• CRISP-DM : how to conduct a data mining

• As a methodology, CRISP-DM includes descriptions

• As a process model, CRISP-DM provides an

CRISP-DM : data mining life cycle

• Stepping back to the data preparation phase is often necessary.

The Data Mining Process

Basic Data Types

• A multidimensional dataset ! is a set of " records #! , #" , … , ## , such that

• The dataset is described by & attributes )! , )" , … , )%

• We note * as the space of all possible records.

• Attributes are also called : fields, dimensions, or features.

Qualitative Binary Quantitative

Nominal Ordinal Ratio Interval

q Brands / categories q Feeling (happy, sad, neutral) q Price q Temperature

• Attribute : valid, values : yes, no

Numeric Quantities (Interval)

• Most schemes accommodate just two levels of measurement:

Attribute types in practice

The Data Mining Process

Data Mining Tasks

Supervised vs. Unsupervised

• Supervised learning : predictive data mining

• Unsupervised learning: descriptive data mining

1. Given email labeled as spam/not spam, learn a spam filter.

not, learn to classify new patients as having diabetes or not.

• Read the comics

You might also like