Unit 2 Data Science Process (P)
Unit 2 Data Science Process (P)
Business Understanding
It is vital to understand the business problem to solved (NOT a prediction
model building!), and then to design a data analytics solution for it
A part of the craft where the analysts’ creativity plays a large role.
The design team should think carefully about the use scenario.
Data Understanding
The data comprise the available raw material from which the solution will
be built.
Estimating the costs and benefits of each data source and deciding
whether further investment is merited.
Understanding the different kinds of data contained in these sources
Data Preparation
Often proceeds along with data understanding.
Including all activities required to convert the disparate data sources to a well-formed
analytics base table
Ex:
converting data to tabular format.
removing or inferring missing values.
converting data to different types.
Modeling
Different data mining tasks are used to build relevant predictive models
Output of modeling is some sort of model or pattern capturing regularities in the data.
Evaluation
Assess the data mining results rigorously and to gain confidence that they are valid
and reliable before moving on.
Usually, a data mining solution is only a piece of the larger solution, and it needs to
be evaluated as such.
Deployment
Put into real use in order to realize some return on investment.
The clearest cases of deployment involve implementing a predictive model in some
information system or business process.
- Predictive modeling
- Descriptive/ explanatory modeling Prepare Data 2. Data Preparation
Deployment 4. Application
Correlation Analysis
- Two factors are correlated when values of x Correlation does not mean
has some predictive power on the value of y.
causation.
- The correlation coefficient of X and Y
measures the degree to which Y is a function The number of police active in a
of X (and visa versa). precinct correlated strongly with the
- Correlation ranges from -1 (anti-correlated)
to 1 (fully correlated) through 0 local crime rate, but the police do
(uncorrelated). not cause the crime.
• SAT scores and freshman GPA (r=0.47)
• Income and coronary disease (r=-0.717)
• Smoking and mortality rate (r=0.716)
• Video games and violent behaviour
(r=0.19)
Causation versus Correlation
Attribute types
- numeric/ continuous
- categorical/ nominal
Final Model
internal use
2/20/2024
3. Modeling
D ata
M odel
D ata M ining
N ew prediction
data
item
M odel
New data item has some value unknown (e.g., will she leave?)
3. Modeling
internal use
2/20/2024
3. Modeling
Splitting training and test data sets (rule of thumb: 2/3 for training; 1/3
test)
Training Data
Test Data
internal use
2/20/2024
3. Modeling
internal use
2/20/2024
3. Modeling
internal use
2/20/2024
4. Application
Posterior knowledge