0% found this document useful (0 votes)
345 views30 pages

Lesson 6 Data Life Cycle Part 2

The document discusses the key phases and activities involved in model planning for data analytics projects. These include exploring the data, selecting relevant variables, identifying candidate models based on the project goals and data structure, and developing datasets for training and testing the models. Iteration between model planning and building phases is common to refine the models. The overall goal is to select the right analytical techniques and variables to address the business objectives.

Uploaded by

Neerom Baldemoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
345 views30 pages

Lesson 6 Data Life Cycle Part 2

The document discusses the key phases and activities involved in model planning for data analytics projects. These include exploring the data, selecting relevant variables, identifying candidate models based on the project goals and data structure, and developing datasets for training and testing the models. Iteration between model planning and building phases is common to refine the models. The overall goal is to select the right analytical techniques and variables to address the business objectives.

Uploaded by

Neerom Baldemoro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction to

Data Science
DATA ANALYTICS LIFE CYCLE
PART 2
Module Objectives
At the end of this module, students must be able to:
1. describe the processes involves in model planning such as data
exploration, variable and model selection;
2. enumerate the key decisions needed to finalize the model as well as
the tools available for model building;
3. discuss the importance of communicating the results obtained to key
stakeholders;
4. describe the steps in operationalizing the results;
Recap:
From previous discussion, we learn about
1. an overview of the data analytics life cycle;
2. the seven key roles in an analytics project;
3. the discovery phase(phase 1) where data science team learns about
the business domain, assesses resources available as well as formulate
initial hypotheses to test in learning about the data.
4. the data preparation phase (phase 2) about preparation of the analytic
sandbox, performing ETLT, data conditioning, etc.
Phase 3 – Model Planning

 In Phase 3, the data science team identifies


candidate models to apply to the data for
clustering, classifying, or finding relationships
in the data depending on the goal of the
project, as shown

 It is during this phase that the team refers to


the hypotheses developed in Phase 1, when
they first became acquainted with the data and
understanding the business problems or
domain area.

 These hypotheses help the team frame the


analytics to execute in Phase 4 and select the
right methods to achieve its objectives.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 3 – Model Planning
Some of the activities to consider in this phase include the following:
 Assess the structure of the datasets. The structure of the datasets is one
factor that dictates the tools and analytical techniques for the next phase.
Depending on whether the team plans to analyze textual data or
transactional data, for example, different tools and approaches are required.

 Ensure that the analytical techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.

 Determine if the situation warrants a single model or a series of techniques


as part of a larger analytic workflow.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 3 – Model Planning

 In addition to the considerations just listed, it is useful to research and


understand how other analysts generally approach a specific kind of
problem.

 Given the kind of data and resources that are available, evaluate whether
similar, existing approaches will work or if the team will need to create
something new. Many times teams can get ideas from analogous problems
that other people have solved in different industry verticals or domain areas.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 3 – Model Planning
 Table 2-2 summarizes the results of
an exercise of this type, involving
several domain areas and the types of
models previously used in a
classification type of problem after
conducting research on churn models
in multiple industry verticals.

 Performing this sort of diligence gives


the team ideas of how others have
solved similar problems and presents
the team with a list of candidate
models to try as part of the model
planning phase.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Planning - Data Exploration and Variable Selection

 In Phase 3, the objective of the data exploration is to understand the


relationships among the variables to inform selection of the variables and
methods and to understand the problem domain. As with earlier phases of
the Data Analytics Lifecycle, it is important to spend time and focus
attention on this preparatory work to make the subsequent phases of model
selection and execution easier and more efficient.

 A common way to conduct this step involves using tools to perform data
visualizations. Approaching the data exploration in this way aids the team in
previewing the data and assessing relationships between variables at a high
level.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Planning - Data Exploration and Variable Selection

 As the team begins to question assumptions and test initial ideas of the
project sponsors and stakeholders, it needs to consider the inputs and data
that will be needed, and then it must examine whether these inputs are
actually correlated with the outcomes that the team plans to predict or
analyze.

 Some methods and types of models will handle correlated variables better
than others. Depending on what the team is attempting to solve, it may
need to consider an alternate method, reduce the number of data inputs, or
transform the inputs to allow the team to use the best method for a given
business problem.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Planning - Data Exploration and Variable Selection

 The key to this approach is to aim for capturing the most essential predictors and variables
rather than considering every possible variable that people think may influence the
outcome.

 Approaching the problem in this manner requires iterations and testing to identify the most
essential variables for the intended analyses. The team should plan to test a range of
variables to include in the model and then focus on the most important and influential
variables.

 If the team plans to run regression analyses, identify the candidate predictors and outcome
variables of the model. Plan to create variables that determine outcomes but demonstrate
a strong relationship to the outcome rather than to the other input variables. This includes
remaining vigilant for problems such as serial correlation, multicollinearity, and other typical
data modeling challenges that interfere with the validity of these models.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Planning – Model Selection
 In the model selection subphase, the team’s main goal is to choose an analytical
technique, or a short list of candidate techniques, based on the end goal of the
project.

 For the context of this book, a model is discussed in general terms. In this case, a
model simply refers to an abstraction from reality. One observes events
happening in a real-world situation or with live data and attempts to construct models
that emulate this behavior with a set of rules and conditions.

 In the case of machine learning and data mining, these rules and conditions are
grouped into several general sets of techniques, such as classification, association
rules, and clustering. When reviewing this list of types of potential models, the team
can winnow down the list to several viable models to try to address a given problem.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Model Planning – Model Selection
 An additional consideration in this area for dealing with Big Data involves
determining if the team will be using techniques that are best suited for
structured data, unstructured data, or a hybrid approach.

 Lastly, the team should take care to identify and document the modeling
assumptions it is making as it chooses and constructs preliminary models.

 Typically, teams create the initial models using a statistical software package
such as R, SAS, or Matlab. Although these tools are designed for data mining
and machine learning algorithms, they may have limitations when applying
the models to very large datasets, as is common with Big Data.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 4 – Model Building

 In Phase 4, the data science team needs to develop datasets for training,
testing, and production purposes. These datasets enable the data scientist
to develop the analytical model and train it (“training data”), while holding
aside some of the data (“hold-out data” or “test data”) for testing the model.

 During this process, it is critical to ensure that the training and test datasets
are sufficiently robust for the model and analytical techniques. A simple way
to think of these datasets is to view the training dataset for conducting the
initial experiments and the test sets for validating an approach once the
initial experiments and models have been run.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 4 – Model Building
 In the model building phase, as
shown, an analytical model is
developed and fit on the training
data and evaluated (scored)
against the test data.

 The phases of model planning and


model building can overlap quite a
bit, and in practice one can iterate
back and forth between the two
phases for a while before settling
on a final model.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 4 – Model Building

 Although the modeling techniques and logic required to develop models can
be highly complex, the actual duration of this phase can be short compared
to the time spent preparing the data and defining the approaches.

 In general, plan to spend more time preparing and learning the data
(Phases 1–2) and crafting a presentation of the findings (Phase 5). Phases
3 and 4 tend to move more quickly, although they are more complex from a
conceptual standpoint. As part of this phase, the data science team needs
to execute the models defined in Phase 3.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 4 – Model Building

 During this phase, users run models from analytical software packages, such as R or SAS, on file
extracts and small datasets for testing purposes. In addition, we assess the validity of the model and its
results as well as determine if the model accounts for most of the data and has robust predictive power.

 Also, at this point, we refine the models to optimize the results, such as by modifying variable inputs or
reducing correlated variables where appropriate. In Phase 3, the team may have had some knowledge
of correlated variables or problematic data attributes, which will be confirmed or denied once the models
are actually executed.

 When immersed in the details of constructing models and transforming data, many small decisions are
often made about the data and the approach for the modeling. These details can be easily forgotten
once the project is completed. Therefore, it is vital to record the results and logic of the model during
this phase. In addition, one must take care to record any operating assumptions that were made in the
modeling process regarding the data or the context.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 4 – Model Building
Creating robust models that are suitable to a specific situation requires
thoughtful consideration to ensure the models being developed ultimately
meet the objectives outlined in Phase 1. Questions to consider include these:

 Does the model appear valid and accurate on the test data?
 Does the model output/behavior make sense to the domain experts? That
is, does it appear as if the model is giving answers that make sense in this
context?
 Do the parameter values of the fitted model make sense in the context of
the domain?
 Is the model sufficiently accurate to meet the goal?
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 4 – Model Building

 Does the model avoid intolerable mistakes?


 Are more data or more inputs needed? Do any of the inputs need to be
transformed or eliminated?
 Will the kind of model chosen support the runtime requirements?
 Is a different form of the model required to address the business problem? If
so, go back to the model planning phase and revise the modeling approach.

 Once the data science team can evaluate either if the model is sufficiently
robust to solve the problem or if the team has failed, it can move to the next
phase in the Data Analytics Lifecycle.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 4 – Model Building
There are many tools available to assist in this phase, focused primarily on
statistical analysis or data mining software. Common tools in this space
include, but are not limited to, the following:

Commercial Tools: Free or Open Source tools:

1. SAS Enterprise Miner 1. R and PL/R


2. SPSS Modeler 2. Octave
3. Matlab 3. WEKA
4. Alpine Miner 4. Python
5. STATISTICA 5. SQL
6. Mathematica

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 5 – Communicate the Results
 After executing the model, the team needs to
compare the outcomes of the modeling to the
criteria established for success and failure.

 In Phase 5, as shown, the team considers


how best to articulate the findings and
outcomes to the various team members and
stakeholders, taking into account caveats,
assumptions, and any limitations of the
results.

 Because the presentation is often circulated


within an organization, it is critical to articulate
the results properly and position the findings
in a way that is appropriate for the audience.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 5 – Communicate the Results
 As part of Phase 5, the team needs to determine if it succeeded or failed in its objectives.
Many times people do not want to admit to failing, but in this instance failure should not be
considered as a true failure, but rather as a failure of the data to accept or reject a given
hypothesis adequately.

 This concept can be counterintuitive for those who have been told their whole careers not to
fail. However, the key is to remember that the team must be rigorous enough with the data to
determine whether it will prove or disprove the hypotheses outlined in Phase 1 (discovery).

 Sometimes teams have only done a superficial analysis, which is not robust enough to
accept or reject a hypothesis. Other times, teams perform very robust analysis and are
searching for ways to show results, even when results may not be there. It is important to
strike a balance between these two extremes when it comes to analyzing data and being
pragmatic in terms of showing real-world results.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 5 – Communicate the Results
 When conducting this assessment, determine if the results are statistically
significant and valid. If they are, identify the aspects of the results that stand
out and may provide salient findings when it comes time to communicate them.

 If the results are not valid, think about adjustments that can be made to refine
and iterate on the model to make it valid. During this step, assess the results
and identify which data points may have been surprising and which were in line
with the hypotheses that were developed in Phase 1.

 Comparing the actual results to the ideas formulated early on produces


additional ideas and insights that would have been missed if the team had not
taken time to formulate initial hypotheses early in the process.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 5 – Communicate the Results
 By this time, the team should have determined which model or models
address the analytical challenge in the most appropriate way. In addition,
the team should have ideas of some of the findings as a result of the
project. The best practice in this phase is to record all the findings and then
select the three most significant ones that can be shared with the
stakeholders.

 In addition, the team needs to reflect on the implications of these findings


and measure the business value. Depending on what emerged as a result
of the model, the team may need to spend time quantifying the business
impact of the results to help prepare for the presentation and demonstrate
the value of the findings.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 5 – Communicate the Results
 Now that the team has run the model, completed a thorough discovery
phase, and learned a great deal about the datasets, reflect on the project
and consider what obstacles were in the project and what can be improved
in the future.

 Make recommendations for future work or improvements to existing


processes, and consider what each of the team members and stakeholders
needs to fulfill her responsibilities. For instance, sponsors must champion
the project. Stakeholders must understand how the model affects their
processes.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 5 – Communicate the Results
 For example, if the team has created a model to predict customer churn, the
Marketing team must understand how to use the churn model predictions in
planning their interventions.

 Production engineers need to operationalize the work that has been done.
In addition, this is the phase to underscore the business benefits of the work
and begin making the case to implement the logic into a live production
environment.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 6 - Operationalize

 In the final phase, the team communicates the benefits of the project more broadly and sets
up a pilot project to deploy the work in a controlled way before broadening the work to a full
enterprise or ecosystem of users.

 Phase 6 represents the first time that most analytics teams approach deploying the new
analytical methods or models in a production environment. Rather than deploying these
models immediately on a wide-scale basis, the risk can be managed more effectively and the
team can learn by undertaking a small scope, pilot deployment before a wide-scale rollout.

 This approach enables the team to learn about the performance and related constraints of
the model in a production environment on a small scale and make adjustments before a full
deployment.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 6 - Operationalize

 Be aware that this phase can bring in a new set of team members—usually
the engineers responsible for the production environment who have a new
set of issues and concerns beyond those of the core project team.

 This technical group needs to ensure that running the model fits smoothly
into the production environment and that the model can be integrated into
related business processes.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 6 - Operationalize
 Part of the operationalizing phase includes creating a mechanism for
performing ongoing monitoring of model accuracy and, if accuracy
degrades, finding ways to retrain the model.

 If feasible, design alerts for when the model is operating “out-of-bounds.”


This includes situations when the inputs are beyond the range that the
model was trained on, which may cause the outputs of the model to be
inaccurate or invalid. If this begins to happen regularly, the model needs to
be retrained on new data.

*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 6 - Operationalize
Although many roles represent many interests within a project, these interests
usually overlap, and most of them can be met with four main deliverables.
 Presentation for project sponsors: This contains high-level takeaways for
executive level stakeholders, with a few key messages to aid their decision-
making process. Focus on clean, easy visuals for the presenter to explain and
for the viewer to grasp.
 Presentation for analysts, which describes business process changes and
reporting changes. Fellow data scientists will want the details and are
comfortable with technical graphs such as Receiver Operating Characteristic
[ROC] curves, density plots, and histograms
 Code for technical people.
 Technical specifications of implementing the code.
*Text taken from Data Science and Big Data Analytics by EMC Education Services
Phase 6 - Operationalize

 As a general rule, the more executive the audience, the more succinct the
presentation needs to be. Most executive sponsors attend many briefings in
the course of a day or a week. Ensure that the presentation gets to the point
quickly and frames the results in terms of value to the sponsor’s
organization.

 When presenting to other audiences with more quantitative backgrounds,


focus more time on the methodology and findings. In these instances, the
team can be more expansive in describing the outcomes, methodology, and
analytical experiment with a peer group.

*Text taken from Data Science and Big Data Analytics by EMC Education Services

You might also like