0% found this document useful (0 votes)
115 views

DSI Magazine ManualForDataScienceProjects

This document provides an overview of the lifecycle of data science projects. It discusses the key steps involved in developing a successful project from an initial business case through to operationalizing the results. The first phase focuses on developing the best model, starting with defining a clear business case, obtaining the right data, cleaning and exploring the data, and developing and evaluating models to create a proof-of-concept. The second phase focuses on ensuring project continuity by progressing the proof-of-concept into a scalable implementation that can be effectively managed and monitored to realize ongoing business value.

Uploaded by

Vikas Dhananjaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views

DSI Magazine ManualForDataScienceProjects

This document provides an overview of the lifecycle of data science projects. It discusses the key steps involved in developing a successful project from an initial business case through to operationalizing the results. The first phase focuses on developing the best model, starting with defining a clear business case, obtaining the right data, cleaning and exploring the data, and developing and evaluating models to create a proof-of-concept. The second phase focuses on ensuring project continuity by progressing the proof-of-concept into a scalable implementation that can be effectively managed and monitored to realize ongoing business value.

Uploaded by

Vikas Dhananjaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Manual for

Data Science projects 1

Manual for
Data Science
Projects

Victor Pereboom Sascha van Weerdenburg


CTO at Dutch Analytics Machine Learning Engineer at
Dutch Analytics
This manual for AI projects within the government has been
developed in cooperation with and based on the AI project from
Ariea Vermeulen and Arie van Kersen from Rijkswaterstaat (the
Ministry of Infrastructure and Water Management). In 2019, they
have started with a project to research whether the combination
of drone images and machine learning could lead to a better
inspection of bridges and infrastructure.

AUTHORS EDITORS

Victor Pereboom Victor Pereboom


CTO at Dutch Analytics CTO at Dutch Analytics

Sascha van Weerdenburg Sascha van Weerdenburg


Machine Learning Engineer at Machine Learning Engineer at
Dutch Analytics Dutch Analytics

Ariea Vermeulen
ABOUT DUTCH ANALYTICS Rijkswaterstaat

The software platform of Dutch Analytics, called Xenia, Arie van Kersen
supports enterprises to operationalize data science Rijkswaterstaat
models. Xenia also manages and monitors Artificial
Intelligence (AI) algorithms after development and Francien Horrevorts
deployment. The idea behind the software originates Rijkswaterstaat, Fran&Vrij Communicatie
from the experience that many companies struggle to
operationalize the results of data science projects. Many Koen Hartog
algorithms are never put into use after the first proof-of- DSI
concept, which wastes time, effort and money. But more
importantly: missed value. By using Xenia, companies Marloes Pomp
can easily take the step from data science model to DSI
successful scalable end product with related business
revenues.

More information about Dutch Analytics and Xenia at


www.dutchanalytics.com
Manual for
Data Science projects 3

In the past five years Data Science has proven itself as a domain with
major socialimpact.

We see increasingly more applications appearing in our society: modern smartphones are
equipped withimage and speech recognition technology. Self-driving cars are no longer part
of only fictional filmscenarios. These developments are not only at the local level. Companies
worldwide are investing ininnovative technologies to develop data-driven solutions and new
products.

However, only a small percentage of these initiatives grows into a fully-fledged solution.
Surprisingly, if you look at how much effort is invested to make these projects successful.
And that raises some follow-up questions.

What are the factors


of a successful data
What is data science ......................................................... 04
science project? In
Types of machine learning and deep learning ................
which steps do you
04

go from an idea to a Lifecycle of a data science project ................................... 05

good solution? Focus on the best model:


from business case to proof-of-concept ........................... 06
Step 1: a good business case
This white paper aims to Step 2: obtain the correct data
Step 3: clean and explore the data
provide more insight into
Step 4: development and evaluation of models
the life cycle of data science
projects and is based on Focus on continuity:
from proof-of-concept to scalable end product .............. 15
experiences gained during
Step 5: from working proof-of-concept to implementation
the development of various Step 6: managing models in operation
complex data-driven solutions. Step 7: from model management to business case
Manual for
4 Data Science projects

What is Data Science?

So what is exactly data science? How does it


relate to AI, for instance Machine Learning and
DeepLearning? We have put all the answers
together for you.

Data Science Deep Learning


Data science is an area where data is examined for Deep Learning is part of Machine Learning that uses
patterns and characteristics. This includes a combination artificial neural networks. This type of model is inspired
of methods and techniques from mathematics, statistics by the structure and function of the human brain.
and computer science. Visualization techniques are
frequently used in order to make the data
Visualization of the
understandable. The focus is on the understanding and
difference between
usage of the data with the aim of obtaining insights Artificial Intelligence (AI),
which can further contribute to the organization. Machine Learning, Deep
Learning and
data science
Artificial Intelligence
Artificial Intelligence is a discipline which enables
computers to mimic human behaviour and intelligence.

Machine Learning
Machine Learning is part of Artificial Intelligence which
focuses on ‘learning’. Algorithms and statistical models
are being designed for autonomous learning of tasks
from data, without giving explicit instructions upfront.

Types of Machine Learning and Deep Learning

There are different types of Machine Learning approaches voor different types of problems. The three most known are:

Supervised Learning
Supervised Learning is the most widely used type of Machine Learning in data science. While inspecting damages on
concrete material, photos of previous examples are being used of which it is known whether they show damage or not.
Each of these photos has been given a label: ‘damaged’ or ‘not damaged’, which helps in further classification.

Unsupervised Learning
In Unsupervised Learning labels are not used, but the model itself tries to discover relations in the data. This is mainly used
for grouping (clustering) the examples from the data. For example, creating different customer groups where customers
with similar characteristics are in the same group. Beforehand it is unknown, which groups of customers there are and
which characteristics they meet, but the algorithm can distinguish the different groups with enough data available.
Manual for
Data Science projects 5

Reinforcement Learning
Finally, a Reinforcement Learning model learns on the basis of trial and error. By rewarding good choices and
punishing bad ones, the model learns to recognize patterns. This technique is mainly used in understanding
(computer) games (such as Go) and in robotics (a robot which learns to walk through falling and standing up again).
This type of Machine Learning usually falls outside of data science, because the purpose of ‘learning’ a task is the
goal and not understanding and using the underlying data.

Life cycle of a Data Science project FOCUS ON THE BEST MODEL: FROM
BUSINESS CASE TO PROOF-OF-CONCEPT

Now that the different terms have In this phase the focus is on the development of the best
model for the specific business case. This is the reason
been explained, we focus on the
why the definition of a good business case is essential.
data science projects. What do you The data science will then work towards a working
need to pay attention to in such prototype (proof-of-concept).
projects, what does it require and
which best practices and learnings The first phase consists of 4 steps:
1. A good business case
can we provide you? We start with
2. Obtain the correct data
a little more background about 3. Clean and explore the data
the life cycle of these projects. The 4. Development and evaluation of models
life cycle of a data science project
consists of two phases, which
FOCUS ON CONTINUITY: FROM PROOF-OF-
contains 7 steps.
CONCEPT TO SCALABLE END PRODUCT

In the second phase the focus is on continuity and a


working prototype will be developed to an operational
end product.

This phase consists of 3 steps:


5. From proof-of-concept to implementation
6. Manage model in operation
7. From model management to business case

Together, these 7 steps (in two phases) form the life


cycle of a data science project
Manual for
6 Data Science projects

How to effectively tackle this process as an


organisation? In the next chapters we detail the
7 steps of the life cycle and we have put all things
together which you need to pay attention to. Every
chapter will close off with a short summary.

FOCUS ON THE BEST MODEL: FROM BUSINESS CASE TO PROOF-OF-CONCEPT

Involve end users


A business case is generally strong when it comes from practice and from
Step 1: A good end-users / domain experts, as they are usually the people who have to use

business case and rely on the information from the models. It is therefore important that
they especially understand the added value of the data science solution.
Where does the need of the user lie and how can the end product create
The ultimate value of a value for the user?
data science project is
dependent on the clear Does AI give the best solution for the problem?
business case. What After defining the business case, it is wise to assess whether AI, in the form
do you want to achieve of self-learning algorithms, is the best solution for the problem. AI is very
with the project? What is suitable for finding patterns in data which is too large and complex for
people to oversee. AI has proven itself valuable in a number of areas, such as
the added value for the
image and speech recognition. AI helps to automate these tasks, however in
organisation and how
some cases, AI is less useful:
will the information from
an algorithm eventually
be used? Here are some
guidelines for defining and
testing the business case.
AI models learn from (large amounts AI is not good in unpredictable
of) data. If little relevant data is situations which require creativity
available or if necessary contextual and intuition to solve them.
information is missing in the data, it
is better to not use an AI solution.

This also applies if transparency of There may simply be more effective


the algorithm is of great importance, and/or cheaper solutions than
because models (in particular data science solutions for a specific
Deep Learning) are often difficult to business case, such as traditional
understand why they show certain softwares.
behaviour / give certain results.
Manual for
Data Science projects 7

NAME STAKEHOLDERS
To make the data science project a success, three stakeholders are required:

AN END-USER /
EXECUTIVE EXPERT SOMEONE FROM THE
(someone from operations) DATA SCIENCE TEAM
The executive expert is The data scientist is
THE MANAGER
responsible for end-users responsible for assessing
(someone from company
acceptance. Without a clear the project’s probability of
policy/strategy)
point of reference and support success, given the potential
The manager is responsible
for the end-users, it is likely that solution and available data.
for checking the added value
a data science model will not be
of the business case.
adopted in practice.

These three stakeholders must be aligned in advance about the objectives of the end product. When there is no
internal data science team available, one can also choose to outsource the project. In that case, the manager is the
coordination point between end-user/executive experts internally and the external data science team.

STEP 1 SUMMARIZED

• Define the business case together with the executive experts.


º What is the need?
º What are the requirements that the solution must offer to be able to generate
significant added value for the company?
• Assess whether applying AI is the best solution for the business case defined.
• ❏ Appoint the relevant stakeholders: from the perspective of policy and strategy (the
manager), the implementation (the end user) and the data scientist.
Manual for
8 Data Science projects

Make sure that data is available


Before a data science team can start building a model, the correct data must

Step 2: Obtain be available. Data science teams depend on data engineers and database
administrators, since these employees have the appropriate permissions to
the correct data access the required data. Executive experts/end-users do not always know
how the data is stored in the databases. Finding the required data is an
Developing an AI algorithm interaction between the executive experts, the data science team and the
often requires large data engineers and/or database administrators. It is also possible that the
required data is not available within the organisation. In that situation one
amounts of high-quality
can choose to collect data from external sources, if available.
data, since a flexible model
will extract its intelligence
from the information Data Dump for the first phase
contained in the data. How
do you obtain this good
data and information? In the first phase of the
data science lifecycle, a
one-off collected data
dump is usually sufficient.
This will partly be used to “train” the model and the other part to “validate
the model”.

When multiple data science teams compete for the same business case, all
teams should have the same data set. This is to ensure that all teams have
an equal chance of finding the best model. The performance of the various
models can be compared.

It is wise to separate an extra ‘test dataset’, which has not been shared with
the data science teams beforehand. Based on the predictions of each model
on this test data set, it can then be compared which model performs best. A
good example of this is how the platform Kaggle operates, which organises
public competitions for data science teams on behalf of companies.
Manual for
Data Science projects 9

Automatically retrieving data in the second phase

In the second phase of


the data science
lifecycle, once a model
becomes operational, a
one-off data dump is no
longer sufficient.
The data must then automatically reach the model for computing. In
practice, this is often rather complicated, because there are so-called data
silos: closed databases which are difficult to integrate into an application.
This is due to the fact because many systems are not designed to easily
communicate with each other. An internal IT security measure makes
communication more difficult. That is why it is recommended to think about
the second phase already in the first phase.

Start setting up infrastructure in time


Investing in a good infrastructure for the storage and exchange of data is
essential for a data science project to become a success. A data engineer
can facilitate this process, where a robust infrastructure is set up. Start with
this process early and keep security, access rights and protection of personal
data in mind.

STEP 2 SUMMARIZED

• Make sure all data is available to data science teams in cooperation with data engineers.
º For the first phase, a single data dump will be sufficient.
º For the second phase, the data have to be retrieved automatically and fed into the
model.
• Start with the organization of a good data infrastructure and associated access rights in time.
Manual for
10 Data Science projects

REPRESENTATIVE DATA

Step 3: Clean It is important that the data which is used for


and explore developing models is as good a representation of
data the reality as possible. A self-learning algorithm
is getting smarter by learning from examples in
When the dataset is the given data. Requirements for the data are the
available, the data science diversity and completeness of the data points.
team can start developing
Furthermore, the data must be up-to-date for
the solution. An important
step is to first clean and
many business cases, because old data might not
then explore the obtained be relevant anymore for the current situation. Be
data. The data must meet aware that no unintended bias is included in the
a number of requirements data provided.
to be suitable for usage
in developing AI models.
The data must be
representative and good
quality.

EXAMPLE

When developing a model for classifying images, one must take diversity
in the images into account. An example for this is searching for damage
in concrete through images. Imagine all photos with damage were taken
on a cloudy day and all photos without damage on a sunny day. It might
be that the data science model will base its choice on the background
colours. A new image of concrete damage on a sunny day can therefore
be classified incorrectly as undamaged.
Manual for
Data Science projects 11

Quality of the data


Besides the variety and completeness of the data, the quality is also of
great importance. A good data structure supports this: the data should
be as complete, consistent and obvious as possible. It is also essential to
prevent human (input) errors as much as possible. Mandatory fields, checks
and categories instead of open boxes of text can provide a solution, when
entering the data.

Confidence
Representativeness and the quality of the data have a positive impact on
the confidence in the project and the end solution. This confidence also
contributes to the acceptance and usage by end users/executive experts.

Short feedback cycle


When exploring the data, it is crucial that the data is correctly interpreted.
This happens through asking feedback from the implementing experts. A
short feedback cycle between the data science team and the implementing
experts is required for this. This can be executed for instance in every few
weeks by giving a presentation of the findings to the implementing experts
and /or the manager.

STEP 3 SUMMARIZED

• Make sure that the data is representative. The data must be complete, diverse, recent
and free from unintended bias. Make sure that important context which may affect the
prediction, are present in the data. Use the knowledge of the executive experts for this.
• ❏ Provide high quality data. Prevent errors in the dataset by catching them as accurately as
possible when entering the data. Evaluate if the data is complete and consistent.
• ❏ Create trust with all stakeholders.
• ❏ Provide a short feedback cycle for the right interpretation of the data.
Manual for
12 Data Science projects

Labels of the dataset


The most commonly used models are in the category of Supervised Learning,
Step 4: where the model learns from a set of data with associated known annotations
or labels. Consider, for example, the aforementioned recognition of concrete
Development damage. The model is trained on a set of photos of concrete structures which
and evaluation of are known to be damaged. If trained, the model can be used to classify new
photos.
models
Unfortunately, there is not always a data set available for which these types
At this stage, the data of labels (in this case ‘damage/no damage present ’) are known. Then it is
scientist has a lot of freedom necessary to create these labels. This task is also referred to as “labelling”
to explore. The goal is to or “annotating” data and is often largely manual work. For the example of
create a lot of business the concrete damage, this means that someone manually goes through
about 2000 photos and indicates whether damage is visible. There are also
value as soon as possible.
methods to speed up this process, for example by only providing a domain
As a result, no data science
expert with photos of which the Machine Learning model is most uncertain.
solution is exactly the Many models indicate themselves with what certainty a classification was
same and data science made.
has a strong experimental
character. What do you have Evaluation of the model
to think of here and what Labels are important for developing the model as well as for assessing the
model. It is possible that the actual value will automatically appear in the
does experimenting involve?
data when it becomes known. In that case, a direct comparison between
On one hand this involves
the actual value and the model prediction can be made. If the actual value
the type of algorithm and does not automatically appear in the data, as in the example of the concrete
its parameters, on the damage classification, a feedback loop can be built into the end solution.
other hand, the variables Then the user will be asked to provide feedback about the correctness of the
constructed from the data classification. This information is important for monitoring the quality of the
(the features). Depending data science model.
on the problem and the
Optimization of the algorithm
available data, different
An algorithm must be optimized and tuned to the problem and the related
categories of AI models
data. By adjusting hyper parameters, “the model’s rules”, the algorithm
can be used. During the is adapted to the application. This optimization step often follows from a
development and evaluation grid search, in which a large set of different values ​​is tried and the best are
of the models, there are a chosen.
couple of points which you
need to pay attention to. Choosing evaluation method
In order to determine the best model, an evaluation method must be defined
which reflects the purpose of the data science model. For example, it is
important in the medical field that extreme errors of the data science model
do not occur, while in other fields extreme errors may have been caused
by extreme measuring points in the data, which people choose to give less
value to. Classification of models includes the balance between inclusivity
(finding all concrete damage) and precision (finding only concrete damage).
The assessment method must correspond to the business case and how the
algorithm will be used in practice.
Manual for
Data Science projects 13

Monitoring the models


Monitoring the quality of data science models is very important. This is due to the
dependency of the data.

Data can change over time, as a result of which a data science model may also show lower
performance over time. As soon as the outside world changes, the data changes with it. This is
how the seasons can have influence on photos and consequently on the results of the model. It
can be useful to add an explanation to the prediction of a data science model, for the end user/
executive expert to understand what the prediction is based on. This provides more insight and
transparency into how the model reasons.

Explainability of the predictions


Making a data science model mainly consists of iterations: running and
evaluating. Also here the feedback from users is important. Do the
predictions of the model make sense? Are there essential variables missing
which could have an impact? Do the identified relationships also have a
causal relationship? The transparency of the algorithm plays an important
role in these relationships. Some types of algorithms are very opaque, so
that given results cannot be traced back to the input data. This is amongst
others, the case for neural networks. More linear methods or traditional
statistical approaches are often easier to interpret. The demand for
transparent algorithms can be seen in the recent developments regarding
Explainable AI. When choosing the model, take the practical requirements
of transparency into account with regard to the explainability of the
predictions.

Prototype finished?
Is the prototype finished? For example, is there an interface available which
shows the results? As soon as the prototype generates value, every step
of the first phase has been completed and the second phase begins, the
operationalization.
Manual for
14 Data Science projects

STEP 4 SUMMARIZED

• Make sure the data set is labelled.


• Make sure that the assessment of the model is part of the end solution.
º Can the actual value be obtained automatically from the data?
º Can the end user/executive expert give feedback?
• Is the algorithm evaluated with the correct assessment method?
º Is the assessment method used consistent with the problem definition and the data
used?
• Make sure that monitoring the quality of the model is maintained just as the changes in the
data. Add for instance an explanation to the predictions of the model, thereby the executive
expert/end-user can understand the predictions.
• Make sure that the data science solution is transparent. Discuss the performance of model
with the executive expert/end-user.
º Do the predictions make sense?
º Does the model meet expectations?
º How can the predictions be used by the executive expert/end user?
• If the prototype generates value, then the process can proceed to the next step:
operationalization.
Manual for
Data Science projects 15

FOCUS ON CONTINUITY: FROM PROOF-OF-CONCEPT TO A STABLE AND


SCALABLE END SOLUTION

We have reached the second phase of the life cycle of a data science project, the operationalization. In this phase a stable and
scalable end solution will be developed from the proof-of-concept. This process consists of three steps.

Production-worthy code
Data science and software development are two surprisingly different
Step 5: From worlds. Where data scientists focus on devising, developing and conducting
experiments with many iterations, software developers are focused on
successful proof- building stable, robust and scalable solutions. These are sometimes difficult

of-concept to to combine.

implementation In the first phase the focus is on quickly realizing the best possible prototype.
It is about the performance of the model and the creation of great business
In step 5, the model and value as quickly as possible. That is why there is often little time and attention
the organization will be for the quality of the programming code. In the second phase the quality
prepared in order to use the of the code is also important. This concerns the handling of possible future
errors, clear documentation and efficiency of the implementation (speed of
model in practice. We have
code).
put all important elements of
this step together for you. That is why in practice the second phase usually starts with restructuring the
code. The final model is structured and detached from all experiments that
have taken place during the previous phase.

Integration into existing processes


The restructured prototype must be integrated into the existing processes in
the organisation. This mainly consists of two steps.

Automatization of data flows Setting up an infrastructure


The retrieval and writing data hosting and managing the model
back to a database must be done The model must be available to all
automatically by means of queries. end users/ executive experts and it
This can be a complex process which should scale with its use.
can be facilitated by a data engineer.
Manual for
16 Data Science projects

Requirements for the production environment


There are different requirements which you can set for hosting the model in
the production environment.
“ When a
model is to be
• When a model is to be used intensively, scalability also becomes used intensively,
relevant. This means that a model is started in parallel several times,
so that all requests are distributed evenly across the models and all scalability
requests can be processed faster. When the number of requests for a
also becomes


model varies greatly over time, it is helpful if the production environment
facilitates automatic scaling. This can be compared to a supermarket relevant.
where several cash registers are opened when it gets busier and
close again when things get quieter. The solution is also ready for the
future. The model can be used more intensively without additional
infrastructure investment.
• Another important condition for the production environment is
availability of the solution. The end-users/executive experts must always
be able to use the model. Think of the automatic restart if systems fail.
Automatic backups can also be part of this.
• A third requirement is authentication and security. People and/or
systems which are allowed to view and use the production environment,
the model and the data must be accessible for them. This requires a safe
and reliable login system.
• Transparency and auditing in the production environment is also a point
of attention. You want to exactly know who changed what and when.
Changes must be traceable, so that it can be traced back to where things
went wrong. Thanks to good logging, you can quickly find out exactly
what happened or went wrong, which makes troubleshooting easier
and faster. Link the auditing to the monitoring of the models, so that it
can be determined whether the performance is related to an update of
the model.

Management of the production environment


When the model produces error messages, they must be corrected. It is
important to clearly define who is responsible for the models which run in the
production environment. This can be the IT department or the data science
team itself. In any case, it will benefit if the code is structured and error
messages are described in a clear log.

Dealing with multiple stakeholders: ownership of data and code


It often happens that an organisation cooperates with an external party for
the development of data science models, or for the delivery of data. Data
handling and data ownership are serious topics and many organisations
attach great importance to protecting their data and controlling who has
access to it. In addition, this may also be required by legislation, such as the
GDPR. The same applies to the model code of a data scientist.
Manual for
Data Science projects 17

It is possible that discussions arise within collaboration between the parties:


who owns the data and/or who owns the algorithm. A neutral model-hosting
platform can offer a solution in order to be able to work with multiple parties.
This platform can be reached by the data scientist with his model code and
by the data supplier with his data. Proper rights can be derived from this
neutral platform. When working with a neutral platform, one can test the data
scientist’s model without having access to its source code. For example, one
could make a data test set and compare several model variants.

The legal aspects of an AI project


There are several legal focus points when it comes to an AI project, so get in touch with
an internal or external legal expert in the early stages.

Several points which you need to pay attention to:

■ Make sure you have an understanding of the marked data, so you are not dependent on one of
the developers.
■ Fix some specific quality norms for development AI models (supervised learning)
■ Intellectual Property (IP): the added value of a model results from the adjustment of the
parameters, explicit agreements must be made to have IP rights thereon.

Lawyer office Pels Rijcken in cooperation with Amsterdam City Council developed a general terms and
conditions for the purchase of AI applications, which is used specifically for making decisions about
citizens. Although purchasing conditions are dependent on the exact case histories, but parts of these
conditions might be reused.

Acceptance
Enthusiasm from the users for the new solution does not always come
naturally. This is often related to the idea that an algorithm poses a future
threat to the employee’s role in an organisation. Yet there is still a sharp
contrast between where the power of people lies and that of an algorithm.
Where computers and algorithms are good at monitoring the continuous
incoming flow of data, people are much better at understanding the context
in which the outcome of an algorithm must be placed. This stimulates in the
first instance to pursue solutions that focus as much as possible on human-
algorithm collaboration. The algorithm can filter and enrich the total flow of
data with discovered associations or properties, and the expert/technician
can spend time interpreting the results and making the ultimate decisions.
Manual for
18 Data Science projects

Liability
How the model is used in practice, partly determines who is responsible for
the decisions which are made. When the data science model is used as a
filter, one will have to think about what happens if the model makes a mistake
in filtering. What are the effects of this and who is held responsible for this?
What about an autonomous data science solution, such as a self-driving car,
for example, who can be held accountable if the car causes an accident?
These are often tough ethical questions, but it is important to consider these
issues prior to integrating a model into a business process.

STEP 5 SUMMARIZED

• ❏ Make sure that the quality of the code is good.


• Integrate the model into the existing organizational processes.
º Automatize data streams
º Make sure that there is an available infrastructure for hosting and model
management
• Pay attention to the requirements of the production environment: scalability, availability,
security and transparency & auditing.
• Specify those responsible for data science model management.
• A neutral platform can help protect IP/ownership of data and model code.
• End users/operational experts must participate in the project.
º Encourage human-algorithm collaboration. And make the expert job more
challenging with no threat to job loss.
º Help end users/executives use the solution and value the benefits of their
contribution.
• Specify who is responsible for errors that arise from incorrect predictions of the data
science model.
• ❏ Discuss the legal aspects of the projects in the early stage with a legal expert.
Manual for
Data Science projects 19

Stap 6:
Managing
models in
operation
A model which runs and
is used in a production
environment must be
It may be necessary to regularly ‘retrain’ a Machine Learning model with new
checked frequently. It is
data. This can be a manual task, but it can also be built into the end solution.
necessary to agree on In the latter case, monitoring the performance of the model is essential.
who is responsible for this, Which model performance and code belongs to which set of training data
and who executes these has to be stored in a structural way, so that changes in the performance of
frequent checks. The data a model can be traced back to the data. This is called ‘ data lineage ‘. More
science team, or the end and more tooling is coming onto the market for this. An example is Data
user /executive expert can Version Control (DvC).

assess whether the model


continues to work well and
remains operational.

STEP 6 SUMMARIZED

• Clearly agree who will continue to monitor the model over time when the model
is running in the operational environment.
• When a model is frequently ‘retrained’ on new data, it is important to
consistently save: when, which version is used, which performance belongs to
which training dataset. This is required for the traceability of model performance.
Manual for
20 Data Science projects

This occurs due to the strong dependence between the code of the
algorithm, the data used to train the algorithm and the continuous flow
Step 7: of new data, influenced by various external factors. It may happen that
environmental factors change, and as a result certain assumption are no
From model longer correct, or that new variables are measured which were previously
management to unavailable.

business case The development of the algorithm therefore continues in the background.
This creates newer model versions for the same business case with software
Frequently checking updates as a result. In practice several model versions run in parallel for a
the performance of a while, so that the differences become transparent. Each model has its own
data science model, any dependencies that must be taken into account.

degrading model output can


When multiple models and projects coexist in the same production
be discovered in good time.
environment, it is of paramount importance that all projects remain
transparent and controllable. Standardisation is required to prevent
fragmentation. E.g. a fixed location (all in the same environment) and the
same (code and data) structure.

Ultimately, data science projects are a closed loop and improvements are
always possible. The ever-changing code, variables and data makes data
science products technically complex on the one hand, but on the other
hand very flexible and particularly powerful when properly applied.

STEP 7 SUMMARIZED

• If possible, run a new model version in parallel with the previous model version for a
while so that they can be compared.
• ❏ Create standards and make them requirements for new data science projects.
º Fixed location (all in the same environment).
º Consistent code structure (if developed internally).
º Uniform data structure.
Manual for
Data Science projects 21

You might also like