02-Data_Mining_The_Data_Mining_Process
02-Data_Mining_The_Data_Mining_Process
EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024
E. Zemmouri
multidimensional data analysis, which is superior to long enough in any dataset (even randomly generated
SQL (a standard data manipulation language) in com- data), one can find patterns that appear to be statisti-
puting summaries and breakdowns along many dimen- cally significant but in fact are not. This issue is of fun- 3
sions. While current OLAP tools target interactive data damental importance to KDD. There has been
analysis, we expect they will also include more auto- substantial progress in understanding such issues in sta-
mated discovery components in the near future. tistics in recent years, much directly relevant to KDD.
Fields concerned with inferring models from data— Thus, data mining is a legitimate activity as long as one
including statistical pattern recognition, applied statis- understands how to do it correctly. KDD can also be
tics, machine learning, and neural networks—were the viewed as encompassing a broader view of modeling
impetus for much early KDD work. KDD largely relies than statistics, aiming to provide tools to automate (to
on methods from these fields to find patterns from data the degree possible) the entire process of data analysis,
in the data mining step of the KDD process. A natural including the statistician’s art of hypothesis selection.
question is: How is KDD different from these other
The process : U. Fayyad et al. 1996
fields? KDD focuses on the overall process of knowl- The KDD Process
edge discovery from data, including how the data is Here we present our (necessarily subjective) perspec-
stored and accessed, how algorithms can be scaled to tive of a unifying process-centric framework for KDD.
• KDD
massive process
datasets and still steps : how results can The goal is to provide an overview of the variety of activ-
run efficiently,
be interpreted and visualized, and how the overall 1
1. Learning
human-machine thecan
interaction application
be modeled domain.
See Providing OLAP to User Analysts: An IT Mandate by E.F. Codd and
and sup- Associates (1993).
2. Creating a target dataset.
COMMUNICATIONS OF THE ACM
Preprocessing29
November 1996/Vol. 39, No. 11
8. Interpretation.
Postprocessing
9. Using discovered knowledge.
4
The process : C. Aggarwal 2015
DATA
PREPROCESSING ANALYTICAL PROCESSING
DATA OUTPUT
COLLECTION CLEANING FOR
FEATURE BUILDING BUILDING
AND ANALYST
EXTRACTION BLOCK 1 BLOCK 2
INTEGRATION
FEEDBACK (OPTIONAL)
E. Zemmouri
FEEDBACK ((OPTIONAL)
5
possible to directly use a standard data mining problem, such as the four “superprob-
lems” discussed earlier, for the application at hand. However, these four problems have
such wide coverage that many applications can be broken up into components that
use these different building blocks. This book will provide examples of this process.
The overall data mining process is illustrated in Fig. 1.1. Note that the analytical block in
Fig. 1.1 shows multiple building blocks representing the design of the solution to a particular
application. This part of the algorithmic design is dependent on the skill of the analyst and
often uses one or more of the four major problems as a building block. This is, of course,
not always the case, but it is frequent enough to merit special treatment of these four
problems within this book. To explain the data mining process, we will use an example
from a recommendation scenario.
The process : C. Aggarwal 2015
Example 1.2.1 Consider a scenario in which a retailer has Web logs corresponding to
customer accesses to Web pages at his or her site. Each of these Web pages corresponds
• A to
typical data
a product, andmining
therefore application contains
a customer access to a page the following
may often phases
be indicative of interest
in that particular product. The retailer also stores demographic profiles for the different
1. Data collection
customers. The retailer wants to make targeted product recommendations to customers using
the customer demographics and buying behavior.
2. Data preprocessing è make date suitable for processing
Sample Solution Pipeline In this case, the first step for the analyst is to collect the
a. Feature extraction
relevant data from two different sources. The first source is the set of Web logs at the
site.b.TheData cleaning
second is the demographic information within the retailer database that were
collected during Web registration of the customer. Unfortunately, these data sets are in
c. Feature selection and transformation
a very different format and cannot easily be used together for processing. For example,
3. Analytical
consider a sampleprocessing
log entry ofand algorithms
the following form:
98.206.207.157 - - [31/Jul/2013:18:09:38 -0700] "GET /productA.htm
HTTP/1.1" 200 328177 "-" "Mozilla/5.0 (Mac OS X) AppleWebKit/536.26
(KHTML, like Gecko) Version/6.0 Mobile/10B329 Safari/8536.25"
E. Zemmouri
"retailer.net"
The log may contain hundreds of thousands of such entries. Here, a customer at IP address
98.206.207.157 has accessed productA.htm. The customer from the IP address can be iden-
tified using the previous login information, by using cookies, or by the IP address itself, 6
but this may be a noisy process and may not always yield accurate results. The analyst
would need to design algorithms for deciding how to filter the different log entries and use
only those which provide accurate results as a part of the cleaning and extraction process.
Furthermore, the raw log contains a lot of additional information that is not necessarily
The process : CRISP-DM
• CRISP-DM
• CRoss-Industry Standard Process For Data Mining
• “An industry-proven way to guide your data mining efforts.”
E. Zemmouri
• http://i2t.icesi.edu.co/ASUM-DM_External/index.htm
8
CRISP-DM : data mining life cycle
• Business understanding
• Understanding the project objectives and requirements from a
business perspective.
• Converting this knowledge into a data mining problem
definition and a preliminary plan designed to achieve the
objectives.
• Data understanding
• Starts with an initial data collection.
• Get familiar with the data.
E. Zemmouri
• Identify data quality problems.
• Discover first insights into the data.
• Modeling
• Various modeling techniques are selected and applied, and
their parameters are calibrated to optimal values.
• Typically, there are several techniques for the same data mining
problem type. Some techniques have specific requirements on
the form of data.
E. Zemmouri
10
CRISP-DM : data mining life cycle
• Evaluation
• You have built a models that appears to have high quality from
a data analysis perspective!
• Evaluate the model and review the steps executed to be certain
it achieves the business objectives.
• Deployment
• Creation of the model is not the end of the project. the
knowledge gained will need to be organized and presented in a
way that the customer can use it.
• It often involves applying live models within an organization’s
decision-making processes.
E. Zemmouri
• The deployment phase can be as simple as generating a report
or as complex as implementing a repeatable data mining
process across the enterprise.
11
• Definition :
• Records are also called : data point, instance, example, transaction, entity, tuple, object,
E. Zemmouri
or feature-vector
Multidimensional Data
• Types of attributes
q Product names q State (high, low / good, bad) q Weight / length q Localization
E. Zemmouri
14
Types of attributes
• Defines the levels of measurement
• Possible attribute types :
• Qualitative :
• Nominal
• Ordinal
• Quantitative :
• Numeric / Interval
E. Zemmouri
15
Nominal quantities
• No relation is implied among nominal values
• è no ordering or distance measure
• Only equality tests can be performed
• Values are distinct symbols
• Values themselves serve only as labels or names
• Examples:
• Attribute : country values : Morocco, Algeria, Tunisia, …
• Attribute : color, values : red, green, blue
• Attribute : gender, values : male, female
E. Zemmouri
E. Zemmouri
17
18
Attribute types in practice
E. Zemmouri
• Example : real, integer, …
19
20
Classification of dataMultidimensional
types: Data
Nominal, ordinal
• Types of attributes and quantitative
Classification of data types:
• N – Nominal (labels)Nominal,
Classification of dataand
ordinal types:
quantitative
• N – Nominal (labels)
– Fruits: : oranges, …
apples,
• Operations : = , !=
Nominal, ordinal
• N – Nominal (labels)and quantitative
• O – Ordered • N – Nominal (labels)
– Fruits: apples, oranges, …
• O – Ordinal :
– Quality of meat:– Grade
• Fruits:
A, AA,
O –apples,
AAA …
oranges,
Ordered
• Q – Interval • O – Ordered
• Operations : = , != ,(location
< , > , <=– , of
>= zeroofarbitrary)
Quality meat: Grade A, AA, AAA
– Quality of meat: Grade A, AA, AAA
– Dates: Jan 5, 2012; Qlocation: (LAT(location
47 LONGof 122)
• Q – Interval (location• of Q•–zero – arbitrary)
Interval Interval
(location zero arbitrary)
of zero arbitrary)
– Like a geometric point.
– Dates: Cannot
– Dates: compare
Jan location:
Jan 5, 2012; directly.
5, 2012;(LAT
location:
47 LONG(LAT
122)47 LONG 122)
• Operations
– Only: =differences
, != , < , >– ,(i.e.
<=–a,intervals)
Like >= , – , +point.
geometric , mean
may
Like a geometric bepoint.
Cannotcompared.
compare directly.
Cannot compare directly.
– Only differences (i.e. intervals) may be compared.
• Can
• measure
Q – Ratiodistances
(zero fixed) – Only differences (i.e. intervals) may be compared.
• Q – Ratio (zero fixed)
• Q – Ratio– (zero fixed)
Physical • Q – Ratio
measurement: (zero
length, mass… fixed)
E. Zemmouri
– Physical measurement: length, mass…
– Counts and amounts – Physical
– Counts measurement: length, mass…
and amounts
• Operations : = , != , < , >– ,Like
<=–a, geometric
>= , – ,and
Counts +vector,
, /amounts
, origin
meanis meaningful
– Like a geometric vector, origin[S.isS.meaningful
Stevens, on the theory of scales of measurements, 1946]
• Can measure ratios May or
2013 proportions
[S. S.–Stevens,
Like a geometric
on theCecilia
theoryvector,
Aragon, of UW
HCDE, origin of
scales is meaningful
measurements, 1946]4
May 2013 Cecilia Aragon, HCDE, UW[S. S. Stevens, on the theory of scales 4of measurements, 1946]
May 2013 Cecilia Aragon, HCDE, UW 4 21
Example
• Titanic Dataset
E. Zemmouri
22
Quiz
• Give an appropriate type for each of the following attributes :
• Student ID
• Department (GIP, GM, GC, …)
• Annual salary
• Marital status
• Number of children
• Rating (bad, medium, good) or stars
E. Zemmouri
23
E. Zemmouri
• …
25
• Supervised learning :
• Right answers are given in a training dataset
• All input data is labeled, and the algorithms learn to predict the output from the input data.
• Unsupervised learning:
• Input dataset is not labeled
E. Zemmouri
• All input data is unlabeled, and the algorithms learn to inherent structure from the input
data.
26
Supervised vs. Unsupervised
• Regression
• …
• Associations
E. Zemmouri
• …
27
Quiz
Of the following examples, which would you address using an
unsupervised learning algorithm?
28
Quiz
E. Zemmouri
29