0% found this document useful (0 votes)
22 views

Unit 2 Data Science Process (P)

The document discusses the data science process and the CRISP-DM framework. It describes the typical stages in a data science project including business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It provides examples of each stage and key tasks involved.

Uploaded by

toan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Unit 2 Data Science Process (P)

The document discusses the data science process and the CRISP-DM framework. It describes the typical stages in a data science project including business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It provides examples of each stage and key tasks involved.

Uploaded by

toan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit 2.

Data science process:


business problems and
data science solution

Assoc. Prof Nguyen Manh Tuan


Opening

 An important principle of data science is that data mining is a process with


fairly well-understood stages; or a set of fairly well-defined subtasks .
- Some involve the application of IT, such as the automated discovery and
evaluation of patterns from data, while others mostly require an analyst’s
creativity, business knowledge, and common sense.
 Each data-driven business decision-making problem is unique, comprising
its own combination of goals, desires, constraints, and even personalities.
 The solutions to the subtasks can then be composed to solve the overall
problem. Some of subtasks are unique to the particular business problem,
but others are common data mining tasks.
 Despite the large number of specific data mining algorithms developed over
the years, there are only a handful of fundamentally different types of tasks
these algorithms address.

2/20/2024 internal use


CRISP-DM process
Cross Industry Standard Process for Data Mining

- End-to-end, multi step, iterative


process
- Going back and forth and at
times back to the 1st step to
redefine the data science problem
statement

2/20/2024 internal use


Process

Business Understanding
 It is vital to understand the business problem to solved (NOT a prediction
model building!), and then to design a data analytics solution for it
 A part of the craft where the analysts’ creativity plays a large role.
 The design team should think carefully about the use scenario.

Data Understanding
 The data comprise the available raw material from which the solution will
be built.
 Estimating the costs and benefits of each data source and deciding
whether further investment is merited.
 Understanding the different kinds of data contained in these sources

2/20/2024 internal use


Process

Data Preparation
 Often proceeds along with data understanding.
 Including all activities required to convert the disparate data sources to a well-formed
analytics base table
 Ex:
 converting data to tabular format.
 removing or inferring missing values.
 converting data to different types.

Modeling
 Different data mining tasks are used to build relevant predictive models
 Output of modeling is some sort of model or pattern capturing regularities in the data.

2/20/2024 internal use


Process

Evaluation
 Assess the data mining results rigorously and to gain confidence that they are valid
and reliable before moving on.
 Usually, a data mining solution is only a piece of the larger solution, and it needs to
be evaluated as such.

Deployment
 Put into real use in order to realize some return on investment.
 The clearest cases of deployment involve implementing a predictive model in some
information system or business process.

2/20/2024 internal use


Process
Business Data
Understanding Understanding 1. Prior Knowledge

- Predictive modeling
- Descriptive/ explanatory modeling Prepare Data 2. Data Preparation

Building Model using


Training Data Algorithms
3. Modeling
Applying Model and
Test Data Performance Evaluation

Deployment 4. Application

Knowledge and Actions


5. Posterior Knowledge

2/20/2024 internal use


1. Prior Knowledge
Gaining information on: Consumer load business (case)
- Objective of the problem - Interest rate (vs principal)
- Federal funds rate (central/national bank)
- Subject area of the problem and - Borrower’s credit score/income level/initial down
contextual information payment amount/ current assets/liabilities
- Data - Lender’s reward (interest) vs risk (default on the
loan)
 An individual: default status is Boolean
 Group of borrowers: default rate – continuous
numeric variable indicates the percentage of
borrowers who default

If the interest rate of past borrowers


with a range of credit scores is
known, can the interest rate for a
new borrower be predicted?

2/20/2024 internal use


1. Prior Knowledge

Correlation Analysis
- Two factors are correlated when values of x Correlation does not mean
has some predictive power on the value of y.
causation.
- The correlation coefficient of X and Y
measures the degree to which Y is a function The number of police active in a
of X (and visa versa). precinct correlated strongly with the
- Correlation ranges from -1 (anti-correlated)
to 1 (fully correlated) through 0 local crime rate, but the police do
(uncorrelated). not cause the crime.
• SAT scores and freshman GPA (r=0.47)
• Income and coronary disease (r=-0.717)
• Smoking and mortality rate (r=0.716)
• Video games and violent behaviour
(r=0.19)
Causation versus Correlation

2/20/2024 internal use


2/20/2024 internal use
2/20/2024 internal use
2/20/2024 internal use
2/20/2024 internal use
1. Prior Knowledge

 A dataset (example set) (sometimes data frame)


 A data point (example, record, object)
 An attribute (feature, dimension, variable, field, predictor/antecedent, input)
 A label (class label, target, response, prediction/consequence, output/outcome)
 Identifiers: for locating/providing context for individual records; excluded in
modeling.

Attribute types
- numeric/ continuous
- categorical/ nominal

2/20/2024 internal use


2. Data Preparation
Data Exploration
descriptive statistics
visualization of data
Data quality
Handling missing values
Data type conversion
Transformation
Outliers
Feature selection
Sampling

2/20/2024 internal use


3. Modeling

Training Data Build model

Test Data Evaluation

Final Model

internal use
2/20/2024
3. Modeling

D ata

M odel
D ata M ining

“Training” data have all


values specified

N ew prediction
data
item
M odel
New data item has some value unknown (e.g., will she leave?)
3. Modeling

Splitting training and test data sets

internal use
2/20/2024
3. Modeling

Splitting training and test data sets (rule of thumb: 2/3 for training; 1/3
test)

Training Data
Test Data

internal use
2/20/2024
3. Modeling

internal use
2/20/2024
3. Modeling

Evaluation of test dataset

internal use
2/20/2024
4. Application

Deployment: the stage at which the model becomes production ready


or live.
 The results of data science process have to assimilated into the
business process (usually in business apps).
 Product readiness
 Technical integration
 Model response time
 Remodeling
 Assimilation

2/20/2024 internal use


5. Posterior Knowledge

Posterior knowledge

2/20/2024 internal use


THE END

You might also like