DSI Magazine ManualForDataScienceProjects
DSI Magazine ManualForDataScienceProjects
Manual for
Data Science
Projects
AUTHORS EDITORS
Ariea Vermeulen
ABOUT DUTCH ANALYTICS Rijkswaterstaat
The software platform of Dutch Analytics, called Xenia, Arie van Kersen
supports enterprises to operationalize data science Rijkswaterstaat
models. Xenia also manages and monitors Artificial
Intelligence (AI) algorithms after development and Francien Horrevorts
deployment. The idea behind the software originates Rijkswaterstaat, Fran&Vrij Communicatie
from the experience that many companies struggle to
operationalize the results of data science projects. Many Koen Hartog
algorithms are never put into use after the first proof-of- DSI
concept, which wastes time, effort and money. But more
importantly: missed value. By using Xenia, companies Marloes Pomp
can easily take the step from data science model to DSI
successful scalable end product with related business
revenues.
In the past five years Data Science has proven itself as a domain with
major socialimpact.
We see increasingly more applications appearing in our society: modern smartphones are
equipped withimage and speech recognition technology. Self-driving cars are no longer part
of only fictional filmscenarios. These developments are not only at the local level. Companies
worldwide are investing ininnovative technologies to develop data-driven solutions and new
products.
However, only a small percentage of these initiatives grows into a fully-fledged solution.
Surprisingly, if you look at how much effort is invested to make these projects successful.
And that raises some follow-up questions.
Machine Learning
Machine Learning is part of Artificial Intelligence which
focuses on ‘learning’. Algorithms and statistical models
are being designed for autonomous learning of tasks
from data, without giving explicit instructions upfront.
There are different types of Machine Learning approaches voor different types of problems. The three most known are:
Supervised Learning
Supervised Learning is the most widely used type of Machine Learning in data science. While inspecting damages on
concrete material, photos of previous examples are being used of which it is known whether they show damage or not.
Each of these photos has been given a label: ‘damaged’ or ‘not damaged’, which helps in further classification.
Unsupervised Learning
In Unsupervised Learning labels are not used, but the model itself tries to discover relations in the data. This is mainly used
for grouping (clustering) the examples from the data. For example, creating different customer groups where customers
with similar characteristics are in the same group. Beforehand it is unknown, which groups of customers there are and
which characteristics they meet, but the algorithm can distinguish the different groups with enough data available.
Manual for
Data Science projects 5
Reinforcement Learning
Finally, a Reinforcement Learning model learns on the basis of trial and error. By rewarding good choices and
punishing bad ones, the model learns to recognize patterns. This technique is mainly used in understanding
(computer) games (such as Go) and in robotics (a robot which learns to walk through falling and standing up again).
This type of Machine Learning usually falls outside of data science, because the purpose of ‘learning’ a task is the
goal and not understanding and using the underlying data.
Life cycle of a Data Science project FOCUS ON THE BEST MODEL: FROM
BUSINESS CASE TO PROOF-OF-CONCEPT
Now that the different terms have In this phase the focus is on the development of the best
model for the specific business case. This is the reason
been explained, we focus on the
why the definition of a good business case is essential.
data science projects. What do you The data science will then work towards a working
need to pay attention to in such prototype (proof-of-concept).
projects, what does it require and
which best practices and learnings The first phase consists of 4 steps:
1. A good business case
can we provide you? We start with
2. Obtain the correct data
a little more background about 3. Clean and explore the data
the life cycle of these projects. The 4. Development and evaluation of models
life cycle of a data science project
consists of two phases, which
FOCUS ON CONTINUITY: FROM PROOF-OF-
contains 7 steps.
CONCEPT TO SCALABLE END PRODUCT
business case and rely on the information from the models. It is therefore important that
they especially understand the added value of the data science solution.
Where does the need of the user lie and how can the end product create
The ultimate value of a value for the user?
data science project is
dependent on the clear Does AI give the best solution for the problem?
business case. What After defining the business case, it is wise to assess whether AI, in the form
do you want to achieve of self-learning algorithms, is the best solution for the problem. AI is very
with the project? What is suitable for finding patterns in data which is too large and complex for
people to oversee. AI has proven itself valuable in a number of areas, such as
the added value for the
image and speech recognition. AI helps to automate these tasks, however in
organisation and how
some cases, AI is less useful:
will the information from
an algorithm eventually
be used? Here are some
guidelines for defining and
testing the business case.
AI models learn from (large amounts AI is not good in unpredictable
of) data. If little relevant data is situations which require creativity
available or if necessary contextual and intuition to solve them.
information is missing in the data, it
is better to not use an AI solution.
NAME STAKEHOLDERS
To make the data science project a success, three stakeholders are required:
AN END-USER /
EXECUTIVE EXPERT SOMEONE FROM THE
(someone from operations) DATA SCIENCE TEAM
The executive expert is The data scientist is
THE MANAGER
responsible for end-users responsible for assessing
(someone from company
acceptance. Without a clear the project’s probability of
policy/strategy)
point of reference and support success, given the potential
The manager is responsible
for the end-users, it is likely that solution and available data.
for checking the added value
a data science model will not be
of the business case.
adopted in practice.
These three stakeholders must be aligned in advance about the objectives of the end product. When there is no
internal data science team available, one can also choose to outsource the project. In that case, the manager is the
coordination point between end-user/executive experts internally and the external data science team.
STEP 1 SUMMARIZED
Step 2: Obtain be available. Data science teams depend on data engineers and database
administrators, since these employees have the appropriate permissions to
the correct data access the required data. Executive experts/end-users do not always know
how the data is stored in the databases. Finding the required data is an
Developing an AI algorithm interaction between the executive experts, the data science team and the
often requires large data engineers and/or database administrators. It is also possible that the
required data is not available within the organisation. In that situation one
amounts of high-quality
can choose to collect data from external sources, if available.
data, since a flexible model
will extract its intelligence
from the information Data Dump for the first phase
contained in the data. How
do you obtain this good
data and information? In the first phase of the
data science lifecycle, a
one-off collected data
dump is usually sufficient.
This will partly be used to “train” the model and the other part to “validate
the model”.
When multiple data science teams compete for the same business case, all
teams should have the same data set. This is to ensure that all teams have
an equal chance of finding the best model. The performance of the various
models can be compared.
It is wise to separate an extra ‘test dataset’, which has not been shared with
the data science teams beforehand. Based on the predictions of each model
on this test data set, it can then be compared which model performs best. A
good example of this is how the platform Kaggle operates, which organises
public competitions for data science teams on behalf of companies.
Manual for
Data Science projects 9
STEP 2 SUMMARIZED
• Make sure all data is available to data science teams in cooperation with data engineers.
º For the first phase, a single data dump will be sufficient.
º For the second phase, the data have to be retrieved automatically and fed into the
model.
• Start with the organization of a good data infrastructure and associated access rights in time.
Manual for
10 Data Science projects
REPRESENTATIVE DATA
EXAMPLE
When developing a model for classifying images, one must take diversity
in the images into account. An example for this is searching for damage
in concrete through images. Imagine all photos with damage were taken
on a cloudy day and all photos without damage on a sunny day. It might
be that the data science model will base its choice on the background
colours. A new image of concrete damage on a sunny day can therefore
be classified incorrectly as undamaged.
Manual for
Data Science projects 11
Confidence
Representativeness and the quality of the data have a positive impact on
the confidence in the project and the end solution. This confidence also
contributes to the acceptance and usage by end users/executive experts.
STEP 3 SUMMARIZED
• Make sure that the data is representative. The data must be complete, diverse, recent
and free from unintended bias. Make sure that important context which may affect the
prediction, are present in the data. Use the knowledge of the executive experts for this.
• ❏ Provide high quality data. Prevent errors in the dataset by catching them as accurately as
possible when entering the data. Evaluate if the data is complete and consistent.
• ❏ Create trust with all stakeholders.
• ❏ Provide a short feedback cycle for the right interpretation of the data.
Manual for
12 Data Science projects
Data can change over time, as a result of which a data science model may also show lower
performance over time. As soon as the outside world changes, the data changes with it. This is
how the seasons can have influence on photos and consequently on the results of the model. It
can be useful to add an explanation to the prediction of a data science model, for the end user/
executive expert to understand what the prediction is based on. This provides more insight and
transparency into how the model reasons.
Prototype finished?
Is the prototype finished? For example, is there an interface available which
shows the results? As soon as the prototype generates value, every step
of the first phase has been completed and the second phase begins, the
operationalization.
Manual for
14 Data Science projects
STEP 4 SUMMARIZED
We have reached the second phase of the life cycle of a data science project, the operationalization. In this phase a stable and
scalable end solution will be developed from the proof-of-concept. This process consists of three steps.
Production-worthy code
Data science and software development are two surprisingly different
Step 5: From worlds. Where data scientists focus on devising, developing and conducting
experiments with many iterations, software developers are focused on
successful proof- building stable, robust and scalable solutions. These are sometimes difficult
of-concept to to combine.
implementation In the first phase the focus is on quickly realizing the best possible prototype.
It is about the performance of the model and the creation of great business
In step 5, the model and value as quickly as possible. That is why there is often little time and attention
the organization will be for the quality of the programming code. In the second phase the quality
prepared in order to use the of the code is also important. This concerns the handling of possible future
errors, clear documentation and efficiency of the implementation (speed of
model in practice. We have
code).
put all important elements of
this step together for you. That is why in practice the second phase usually starts with restructuring the
code. The final model is structured and detached from all experiments that
have taken place during the previous phase.
”
model varies greatly over time, it is helpful if the production environment
facilitates automatic scaling. This can be compared to a supermarket relevant.
where several cash registers are opened when it gets busier and
close again when things get quieter. The solution is also ready for the
future. The model can be used more intensively without additional
infrastructure investment.
• Another important condition for the production environment is
availability of the solution. The end-users/executive experts must always
be able to use the model. Think of the automatic restart if systems fail.
Automatic backups can also be part of this.
• A third requirement is authentication and security. People and/or
systems which are allowed to view and use the production environment,
the model and the data must be accessible for them. This requires a safe
and reliable login system.
• Transparency and auditing in the production environment is also a point
of attention. You want to exactly know who changed what and when.
Changes must be traceable, so that it can be traced back to where things
went wrong. Thanks to good logging, you can quickly find out exactly
what happened or went wrong, which makes troubleshooting easier
and faster. Link the auditing to the monitoring of the models, so that it
can be determined whether the performance is related to an update of
the model.
■ Make sure you have an understanding of the marked data, so you are not dependent on one of
the developers.
■ Fix some specific quality norms for development AI models (supervised learning)
■ Intellectual Property (IP): the added value of a model results from the adjustment of the
parameters, explicit agreements must be made to have IP rights thereon.
Lawyer office Pels Rijcken in cooperation with Amsterdam City Council developed a general terms and
conditions for the purchase of AI applications, which is used specifically for making decisions about
citizens. Although purchasing conditions are dependent on the exact case histories, but parts of these
conditions might be reused.
Acceptance
Enthusiasm from the users for the new solution does not always come
naturally. This is often related to the idea that an algorithm poses a future
threat to the employee’s role in an organisation. Yet there is still a sharp
contrast between where the power of people lies and that of an algorithm.
Where computers and algorithms are good at monitoring the continuous
incoming flow of data, people are much better at understanding the context
in which the outcome of an algorithm must be placed. This stimulates in the
first instance to pursue solutions that focus as much as possible on human-
algorithm collaboration. The algorithm can filter and enrich the total flow of
data with discovered associations or properties, and the expert/technician
can spend time interpreting the results and making the ultimate decisions.
Manual for
18 Data Science projects
Liability
How the model is used in practice, partly determines who is responsible for
the decisions which are made. When the data science model is used as a
filter, one will have to think about what happens if the model makes a mistake
in filtering. What are the effects of this and who is held responsible for this?
What about an autonomous data science solution, such as a self-driving car,
for example, who can be held accountable if the car causes an accident?
These are often tough ethical questions, but it is important to consider these
issues prior to integrating a model into a business process.
STEP 5 SUMMARIZED
Stap 6:
Managing
models in
operation
A model which runs and
is used in a production
environment must be
It may be necessary to regularly ‘retrain’ a Machine Learning model with new
checked frequently. It is
data. This can be a manual task, but it can also be built into the end solution.
necessary to agree on In the latter case, monitoring the performance of the model is essential.
who is responsible for this, Which model performance and code belongs to which set of training data
and who executes these has to be stored in a structural way, so that changes in the performance of
frequent checks. The data a model can be traced back to the data. This is called ‘ data lineage ‘. More
science team, or the end and more tooling is coming onto the market for this. An example is Data
user /executive expert can Version Control (DvC).
STEP 6 SUMMARIZED
• Clearly agree who will continue to monitor the model over time when the model
is running in the operational environment.
• When a model is frequently ‘retrained’ on new data, it is important to
consistently save: when, which version is used, which performance belongs to
which training dataset. This is required for the traceability of model performance.
Manual for
20 Data Science projects
This occurs due to the strong dependence between the code of the
algorithm, the data used to train the algorithm and the continuous flow
Step 7: of new data, influenced by various external factors. It may happen that
environmental factors change, and as a result certain assumption are no
From model longer correct, or that new variables are measured which were previously
management to unavailable.
business case The development of the algorithm therefore continues in the background.
This creates newer model versions for the same business case with software
Frequently checking updates as a result. In practice several model versions run in parallel for a
the performance of a while, so that the differences become transparent. Each model has its own
data science model, any dependencies that must be taken into account.
Ultimately, data science projects are a closed loop and improvements are
always possible. The ever-changing code, variables and data makes data
science products technically complex on the one hand, but on the other
hand very flexible and particularly powerful when properly applied.
STEP 7 SUMMARIZED
• If possible, run a new model version in parallel with the previous model version for a
while so that they can be compared.
• ❏ Create standards and make them requirements for new data science projects.
º Fixed location (all in the same environment).
º Consistent code structure (if developed internally).
º Uniform data structure.
Manual for
Data Science projects 21