0% found this document useful (0 votes)
6 views

02-Data_Mining_The_Data_Mining_Process

The document discusses the Knowledge Discovery and Data Mining (KDD) process, defining it as a nontrivial process of identifying valid and useful patterns in data. It outlines the steps involved in KDD, including data selection, preprocessing, mining, and interpretation, and emphasizes the importance of understanding data types and attributes in the mining process. Additionally, it introduces the CRISP-DM methodology as a structured approach to guide data mining projects.

Uploaded by

Ahmed Ajebli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

02-Data_Mining_The_Data_Mining_Process

The document discusses the Knowledge Discovery and Data Mining (KDD) process, defining it as a nontrivial process of identifying valid and useful patterns in data. It outlines the steps involved in KDD, including data selection, preprocessing, mining, and interpretation, and emphasizes the importance of understanding data types and attributes in the mining process. Additionally, it introduces the CRISP-DM methodology as a structured approach to guide data mining projects.

Uploaded by

Ahmed Ajebli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Knowledge Discovery and Data Mining

The Data Mining Process

EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024

Knowledge Discovery and Data Mining


• Knowledge Discovery from Data (KDD) since 1989
• Definition : KDD process (U. Fayyad et al. 1996):
• “The nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data”.
• KDD refers to the overall process of discovering useful knowledge from data.
• Data Mining refers to a particular step in this process.
• Data mining is the application of specific algorithms for extracting patterns
from data.
• Currently KDD and Data Mining are used as equivalent
E. Zemmouri

• Data Mining is a complex and multistage process.


2
machine learning, pattern recognition, statistics, artifi- modeling algorithms for large noisy
cial intelligence and reasoning with uncertainty, knowl- datasets are also of fundamental interest.
edge acquisition for expert systems, data visualization, Statistics has much in common with KDD.
machine discovery [7], scientific discovery, information Inference of knowledge from data has a fundamental
retrieval, and high-performance computing. KDD soft- statistical component (see [2] and the article by Gly-
ware systems incorporate theories, algorithms, and mour on statistical inference in this special section for
methods from all of these fields. more detailed discussions of the relationship between
Database theories and tools provide the necessary KDD and statistics). Statistics provides a language and
The process : U. Fayyad et al. 1996
infrastructure to store, access, and manipulate data. framework for quantifying the uncertainty resulting
Data warehousing, a recently popularized term, refers when one tries to infer general patterns from a partic-
to the current business trend of collecting and cleaning ular sample of an overall population. As mentioned
transactional data to make them available for online earlier, the term data mining has had negative conno-
• Overview
analysis of the
and decision steps
support. constituting
A popular approach the
forKDD process
tations in statistics since the 1960s, when computer-
analysis• ofThe
data
KDDwarehouses
process isisinteractive
called online
andanalytical
iterative based data analysis techniques were first introduced.
processing (OLAP).1 OLAP tools focus on providing The concern arose over the fact that if one searches

Figure 1. Overview of the steps constituting the KDD process

Pre- Trans- Data Interpretation/


Selection processing formation Mining Evaluation

Target Preprocessed Transformed Patterns Knowledge


Data Data Data
Data

E. Zemmouri
multidimensional data analysis, which is superior to long enough in any dataset (even randomly generated
SQL (a standard data manipulation language) in com- data), one can find patterns that appear to be statisti-
puting summaries and breakdowns along many dimen- cally significant but in fact are not. This issue is of fun- 3
sions. While current OLAP tools target interactive data damental importance to KDD. There has been
analysis, we expect they will also include more auto- substantial progress in understanding such issues in sta-
mated discovery components in the near future. tistics in recent years, much directly relevant to KDD.
Fields concerned with inferring models from data— Thus, data mining is a legitimate activity as long as one
including statistical pattern recognition, applied statis- understands how to do it correctly. KDD can also be
tics, machine learning, and neural networks—were the viewed as encompassing a broader view of modeling
impetus for much early KDD work. KDD largely relies than statistics, aiming to provide tools to automate (to
on methods from these fields to find patterns from data the degree possible) the entire process of data analysis,
in the data mining step of the KDD process. A natural including the statistician’s art of hypothesis selection.
question is: How is KDD different from these other
The process : U. Fayyad et al. 1996
fields? KDD focuses on the overall process of knowl- The KDD Process
edge discovery from data, including how the data is Here we present our (necessarily subjective) perspec-
stored and accessed, how algorithms can be scaled to tive of a unifying process-centric framework for KDD.
• KDD
massive process
datasets and still steps : how results can The goal is to provide an overview of the variety of activ-
run efficiently,
be interpreted and visualized, and how the overall 1
1. Learning
human-machine thecan
interaction application
be modeled domain.
See Providing OLAP to User Analysts: An IT Mandate by E.F. Codd and
and sup- Associates (1993).
2. Creating a target dataset.
COMMUNICATIONS OF THE ACM
Preprocessing29
November 1996/Vol. 39, No. 11

3. Data cleaning and preprocessing.


4. Data reduction and projection.
5. Choosing the function (task) of data mining.
6. Choosing the data mining algorithm(s). Data Mining
7. Data mining.
E. Zemmouri

8. Interpretation.
Postprocessing
9. Using discovered knowledge.
4
The process : C. Aggarwal 2015

• The workflow of a typical data mining application

4 CHAPTER 1. AN INTRODUCTION TO DATA MINING

DATA
PREPROCESSING ANALYTICAL PROCESSING
DATA OUTPUT
COLLECTION CLEANING FOR
FEATURE BUILDING BUILDING
AND ANALYST
EXTRACTION BLOCK 1 BLOCK 2
INTEGRATION

FEEDBACK (OPTIONAL)

E. Zemmouri
FEEDBACK ((OPTIONAL)

Figure 1.1: The data processing pipeline

5
possible to directly use a standard data mining problem, such as the four “superprob-
lems” discussed earlier, for the application at hand. However, these four problems have
such wide coverage that many applications can be broken up into components that
use these different building blocks. This book will provide examples of this process.
The overall data mining process is illustrated in Fig. 1.1. Note that the analytical block in
Fig. 1.1 shows multiple building blocks representing the design of the solution to a particular
application. This part of the algorithmic design is dependent on the skill of the analyst and
often uses one or more of the four major problems as a building block. This is, of course,
not always the case, but it is frequent enough to merit special treatment of these four
problems within this book. To explain the data mining process, we will use an example
from a recommendation scenario.
The process : C. Aggarwal 2015
Example 1.2.1 Consider a scenario in which a retailer has Web logs corresponding to
customer accesses to Web pages at his or her site. Each of these Web pages corresponds
• A to
typical data
a product, andmining
therefore application contains
a customer access to a page the following
may often phases
be indicative of interest
in that particular product. The retailer also stores demographic profiles for the different
1. Data collection
customers. The retailer wants to make targeted product recommendations to customers using
the customer demographics and buying behavior.
2. Data preprocessing è make date suitable for processing
Sample Solution Pipeline In this case, the first step for the analyst is to collect the
a. Feature extraction
relevant data from two different sources. The first source is the set of Web logs at the
site.b.TheData cleaning
second is the demographic information within the retailer database that were
collected during Web registration of the customer. Unfortunately, these data sets are in
c. Feature selection and transformation
a very different format and cannot easily be used together for processing. For example,
3. Analytical
consider a sampleprocessing
log entry ofand algorithms
the following form:
98.206.207.157 - - [31/Jul/2013:18:09:38 -0700] "GET /productA.htm
HTTP/1.1" 200 328177 "-" "Mozilla/5.0 (Mac OS X) AppleWebKit/536.26
(KHTML, like Gecko) Version/6.0 Mobile/10B329 Safari/8536.25"
E. Zemmouri

"retailer.net"
The log may contain hundreds of thousands of such entries. Here, a customer at IP address
98.206.207.157 has accessed productA.htm. The customer from the IP address can be iden-
tified using the previous login information, by using cookies, or by the IP address itself, 6
but this may be a noisy process and may not always yield accurate results. The analyst
would need to design algorithms for deciding how to filter the different log entries and use
only those which provide accurate results as a part of the cleaning and extraction process.
Furthermore, the raw log contains a lot of additional information that is not necessarily
The process : CRISP-DM
• CRISP-DM
• CRoss-Industry Standard Process For Data Mining
• “An industry-proven way to guide your data mining efforts.”

• Conceived in 1996 by a consortium of 5 companies and first published in 1999


• Reported as the leading methodology used by Data Miners in many polls
• IBM developed ASUM-DM is an extended and refined CRISP-DM for Data
Mining/Predictive Analytics projects
• Analytics Solutions Unified Method

E. Zemmouri
• http://i2t.icesi.edu.co/ASUM-DM_External/index.htm

The process : CRISP-DM

• CRISP-DM : how to conduct a data mining


project

• As a methodology, CRISP-DM includes descriptions


of the typical phases of a project, the tasks
involved with each phase, and an explanation of
the relationships between these tasks.

• As a process model, CRISP-DM provides an


overview of the data mining project life cycle.
E. Zemmouri

8
CRISP-DM : data mining life cycle
• Business understanding
• Understanding the project objectives and requirements from a
business perspective.
• Converting this knowledge into a data mining problem
definition and a preliminary plan designed to achieve the
objectives.

• Data understanding
• Starts with an initial data collection.
• Get familiar with the data.

E. Zemmouri
• Identify data quality problems.
• Discover first insights into the data.

CRISP-DM : data mining life cycle


• Data preparation
• Covers all activities to construct the final dataset from the
initial raw data è data that will be fed into the modeling tools.
• Include tasks : data selection, transformation and cleaning.

• Modeling
• Various modeling techniques are selected and applied, and
their parameters are calibrated to optimal values.
• Typically, there are several techniques for the same data mining
problem type. Some techniques have specific requirements on
the form of data.
E. Zemmouri

• Stepping back to the data preparation phase is often necessary.

10
CRISP-DM : data mining life cycle
• Evaluation
• You have built a models that appears to have high quality from
a data analysis perspective!
• Evaluate the model and review the steps executed to be certain
it achieves the business objectives.
• Deployment
• Creation of the model is not the end of the project. the
knowledge gained will need to be organized and presented in a
way that the customer can use it.
• It often involves applying live models within an organization’s
decision-making processes.

E. Zemmouri
• The deployment phase can be as simple as generating a report
or as complex as implementing a repeatable data mining
process across the enterprise.

11

The Data Mining Process

Basic Data Types


Multidimensional Data

• Definition :

• A multidimensional dataset ! is a set of " records #! , #" , … , ## , such that


each record #$ contains a set of & features denoted as (#$! , #$" , … , #$% ).

• The dataset is described by & attributes )! , )" , … , )%

• We note * as the space of all possible records.

• Records are also called : data point, instance, example, transaction, entity, tuple, object,

E. Zemmouri
or feature-vector

• Attributes are also called : fields, dimensions, or features.


13

Multidimensional Data
• Types of attributes

Qualitative Binary Quantitative


Categorical Numeric

Nominal Ordinal Ratio Interval

q Product names q State (high, low / good, bad) q Weight / length q Localization
E. Zemmouri

q Brands / categories q Feeling (happy, sad, neutral) q Price q Temperature


q Countries/cities names q Category (gold, silver, bronze) q Count q IQ score
q Colors q Grade A, B, C, D q Income
q Annual sales

14
Types of attributes
• Defines the levels of measurement
• Possible attribute types :
• Qualitative :
• Nominal
• Ordinal
• Quantitative :
• Numeric / Interval

E. Zemmouri
15

Nominal quantities
• No relation is implied among nominal values
• è no ordering or distance measure
• Only equality tests can be performed
• Values are distinct symbols
• Values themselves serve only as labels or names
• Examples:
• Attribute : country values : Morocco, Algeria, Tunisia, …
• Attribute : color, values : red, green, blue
• Attribute : gender, values : male, female
E. Zemmouri

• Attribute : valid, values : yes, no


• Brands, categories, jobs, …
16
Ordinal quantities
• Impose order on values
• But: no distance between values defined
• Examples:
• Attribute : age, Values : child < young < adult
• Attribute : temperature Values : hot > mild > cool
• Attribute : grade Values : A, B, C, D
• Attribute : emotion Values : sad, neutral, happy
• Note: addition and subtraction don’t make sense

E. Zemmouri
17

Numeric Quantities (Interval)


• Interval quantities are not only ordered but measured in fixed and
equal units
• Examples:
• Attribute temperature expressed in degrees Celsius
• Attribute age expressed in years
• Count, weight, height, …
• Difference of two values makes sense
E. Zemmouri

18
Attribute types in practice

• Most schemes accommodate just two levels of measurement:


• Nominal and Numeric (Ordinal)
• Nominal attributes are also called categorical, enumerated, or
discrete
• Special case: binary attributes (Yes/No, 0/1, True/False)
• Only equality tests possible
• numeric attributes
• Values are ordered

E. Zemmouri
• Example : real, integer, …

19

Attribute types in practice


• Qualitative vs Quantitative ?
• Sometimes transformations are need !!
• ...
• Why Data Mining algorithms need to know about attributes types?
• Color > blue : doesn’t make sense !!
• Age > 30 : does
E. Zemmouri

20
Classification of dataMultidimensional
types: Data
Nominal, ordinal
• Types of attributes and quantitative
Classification of data types:
• N – Nominal (labels)Nominal,
Classification of dataand
ordinal types:
quantitative
• N – Nominal (labels)
– Fruits: : oranges, …
apples,
• Operations : = , !=
Nominal, ordinal
• N – Nominal (labels)and quantitative
• O – Ordered • N – Nominal (labels)
– Fruits: apples, oranges, …
• O – Ordinal :
– Quality of meat:– Grade
• Fruits:
A, AA,
O –apples,
AAA …
oranges,
Ordered
• Q – Interval • O – Ordered
• Operations : = , != ,(location
< , > , <=– , of
>= zeroofarbitrary)
Quality meat: Grade A, AA, AAA
– Quality of meat: Grade A, AA, AAA
– Dates: Jan 5, 2012; Qlocation: (LAT(location
47 LONGof 122)
• Q – Interval (location• of Q•–zero – arbitrary)
Interval Interval
(location zero arbitrary)
of zero arbitrary)
– Like a geometric point.
– Dates: Cannot
– Dates: compare
Jan location:
Jan 5, 2012; directly.
5, 2012;(LAT
location:
47 LONG(LAT
122)47 LONG 122)
• Operations
– Only: =differences
, != , < , >– ,(i.e.
<=–a,intervals)
Like >= , – , +point.
geometric , mean
may
Like a geometric bepoint.
Cannotcompared.
compare directly.
Cannot compare directly.
– Only differences (i.e. intervals) may be compared.
• Can
• measure
Q – Ratiodistances
(zero fixed) – Only differences (i.e. intervals) may be compared.
• Q – Ratio (zero fixed)
• Q – Ratio– (zero fixed)
Physical • Q – Ratio
measurement: (zero
length, mass… fixed)

E. Zemmouri
– Physical measurement: length, mass…
– Counts and amounts – Physical
– Counts measurement: length, mass…
and amounts
• Operations : = , != , < , >– ,Like
<=–a, geometric
>= , – ,and
Counts +vector,
, /amounts
, origin
meanis meaningful
– Like a geometric vector, origin[S.isS.meaningful
Stevens, on the theory of scales of measurements, 1946]
• Can measure ratios May or
2013 proportions
[S. S.–Stevens,
Like a geometric
on theCecilia
theoryvector,
Aragon, of UW
HCDE, origin of
scales is meaningful
measurements, 1946]4
May 2013 Cecilia Aragon, HCDE, UW[S. S. Stevens, on the theory of scales 4of measurements, 1946]
May 2013 Cecilia Aragon, HCDE, UW 4 21

Example
• Titanic Dataset
E. Zemmouri

22
Quiz
• Give an appropriate type for each of the following attributes :
• Student ID
• Department (GIP, GM, GC, …)
• Annual salary
• Marital status
• Number of children
• Rating (bad, medium, good) or stars

E. Zemmouri
23

The Data Mining Process

Data Mining Tasks


Major Data Mining Tasks
• Classification: predicting an item class
• Clustering: finding clusters in data
• Associations: e.g. A & B & C occur frequently
• Visualization: to facilitate human discovery
• Summarization: describing a group
• Deviation Detection: finding changes
• Estimation: predicting a continuous value
• Link Analysis: finding relationships

E. Zemmouri
• …

25

Supervised vs. Unsupervised


• Learning principal : “using a set of observations to uncover an underlying
process”
• Training dataset

• Supervised learning :
• Right answers are given in a training dataset
• All input data is labeled, and the algorithms learn to predict the output from the input data.

• Unsupervised learning:
• Input dataset is not labeled
E. Zemmouri

• All input data is unlabeled, and the algorithms learn to inherent structure from the input
data.

26
Supervised vs. Unsupervised

• Supervised learning : predictive data mining


• Classification

• Regression

• …

• Unsupervised learning: descriptive data mining


• Clustering

• Associations

E. Zemmouri
• …

27

Quiz
Of the following examples, which would you address using an
unsupervised learning algorithm?

1. Given email labeled as spam/not spam, learn a spam filter.


2. Given a set of news articles found on the web, group them into
set of articles about the same story.
3. Given a database of customer data, automatically discover market
segments and group customers into different market segments.
4. Given a dataset of patients diagnosed as either having diabetes or
E. Zemmouri

not, learn to classify new patients as having diabetes or not.

28
Quiz

• Read the comics


about an engineer
interview. Does he
used supervised or
unsupervised
learning ?

E. Zemmouri
29

You might also like