When A Failing Test Might Be OK - Random Tech Thoughts
When A Failing Test Might Be OK - Random Tech Thoughts
2023, 08:25
In automated tests, each test does a different job – why bother having more than one test to deliver
some information? It’s common for there to be more test code than production code and
maintaining test code is a cost that can’t be ignored. Redundancy in the test code means that this
maintenance cost is bigger than it needs to be.
This means that any test failing is a sign that at least one part of the system isn’t behaving as
expected.
However, there are times where there is important variability – either intrinsic to the problem
being solved, or unavoidable variability in the way the problem is solved. When it’s too hard to
predict this variability accurately or how it will affect the test outcome, one approach is to create a
set of related tests, and this set is in many ways treated as a single test. The idea is that, while it’s
too hard to predict the behaviour of the system via a single (component) test, it is still possible to
predict its behaviour in general, i.e. across the set of tests.
I’ll go into some examples below, where I’ll describe what gives rise to the variability and how the
success criteria are defined for the tests/set of tests.
Note that flaky tests are a similar but different problem. By flaky tests I mean tests that sometimes
pass and sometimes fail, and so give unreliable results. This is often due to variability in the order
in which different parts of the production or test code are executed, and this variability trips the
tests up. Flaky tests are something that should be fixable so that individual tests reliably pass or
fail, but this might require changes to the production code as well as to the test code. The rest of
this article concerns times when variability can’t be dealt with such that individual tests are
reliable.
Performance tests
In this context, I’m using performance as a synonym for latency – how long will the system take to
respond to a request? One way to specify performance requirements is in terms of percentages.
For instance:
The performance requirements might be motivated by several things. One might be to ensure the
user gets a good user experience (https://randomtechthoughts.blog/category/user-experience/)
via a GUI. How quickly the GUI and systems such as API, database etc. behind respond to a user
request will influence the user’s perception of and enjoyment of the system. Alternatively, there
might not be any GUI or user involved directly, but the production system is an API that’s called
by other code. The performance requirements might be to ensure that many separate bits of code
can collaborate to create a bigger system, such as a phone network
(https://en.wikipedia.org/wiki/Signalling_System_No._7).
To test the system against its performance requirements, a suitable set of requests is created and
sent to the system under test. The meaning of suitable depends on context, but it is likely to follow
the pattern of requests that has been observed in production already. For instance, for an online
banking system one part of the pattern could be:
In the light of the requirements above, if a request takes 0.7 seconds during a test – is that OK?
Definitely not. If a request takes 0.09 seconds, that’s fine. If a request takes 0.3 seconds, then
things might be OK. If too many other requests are also in the 0.1 – 0.5 second range, then the set
of tests as a whole fails.
Why can’t performance be specified more tightly? There are two sources of variation – the
mixture of different kinds of request, and real-world limitations of the implementation (which I
will explain shortly). The different kinds of request (e.g. the banking related ones above) all need
to be completed before the user thinks things are too slow, even if they need different amounts of
work to complete.
The other source of variation is the limitations of the implementation. There are many helpful lies
that one part of the system tells to other parts. The lies are essentially saying that things are
simpler or better than they actually are. Enough of the time the pretence holds, but occasionally
the lie shows through and this affects performance.
The database / CPU / etc. is faster than it actually is – a lie involving some kind of cache;
The CPU, memory, network, and other important resources that some process needs are used
only by that process and nothing else – a lie involving virtualisation and other ways to share
things.
Very early computers exposed programmers to all the details of their hardware, so having two or
more programs running on a computer at once became tricky. How will they share the CPU,
memory, disk space etc? More modern computers take that burden off the programmer and
handle it in things like the operating system.
The operating system creates illusions such as virtual memory – a contiguous chunk of memory
that is solely for one program, even though behind the scenes this is made of several separate
chunks of physical memory, and many different programs are using the physical memory at once.
Similarly, each program thinks it’s running on a CPU dedicated to running just that program. In
reality, each program gets a series of slices of CPU time, to allow the CPU to be shared across
many programs.
Enough of the time, the difference between appearance and reality is fine. However, it can
sometimes cause delays in the execution of code. For example, there is a program that is blocked
waiting for a long database query, and so the operating system has decided to divert some of its
physical memory to be used by another program that isn’t currently blocked on anything. Before
this happens, the contents of the memory are written to disk. A little while later, when the
database query finishes, the program is ready to run again and so needs that bit of memory back.
There will therefore be a delay while the data is read from disk back to memory.
These costs will happen at hard to predict times, as they are based on the interactions of many
moving parts at many levels of abstraction. Therefore, it’s easier to describe the system’s latency
in general terms such as percentages and ranges.
Data science
https://randomtechthoughts.blog/2023/05/21/when-a-failing-test…source=Coding_Jag&utm_medium=Email&utm_campaign=Coding_ jag_141 Page 3 of 6
When a failing test might be OK – Random Tech Thoughts 29.05.2023, 08:25
In this situation, the variability is intrinsic to the problem. Not all cats look the same, and a given
cat will look different from different angles, in different lights, or in different poses. It is usually
unrealistic to expect a classifier to get 100% accurate results all the time. Sometimes it will come
up with the wrong answer (poor accuracy) or won’t be able to give any answer (poor coverage).
One way to represent the behaviour of a classifier is with a confusion matrix, such as the one
below:
Actual
Cat Dog Volcano Other Don’t know
Expected Cat 24 2 2
Dog 38 1
Volcano 15
Other 6 11 1
The numbers show a percentage of all images (tests) in the set of tests. For instance, 24% of images
are cats that are correctly classified as a cat. 2% of images are cats that are (incorrectly) classified
as a volcano etc. Cells containing zero are left blank for clarity. The diagonal line of italic numbers
is where the system is behaving as intended – the actual result matches the expected result.
Everything else is some kind of error.
The right-hand column shows how much of a coverage problem the classifier has, i.e. how many
times it has failed to come up with any answer. The cells that are neither italic nor in the right-
hand column show how much of an accuracy problem the classifier has, i.e. how many times it has
come up with the wrong answer.
If there are only two categories (and no don’t know column) then a confusion matrix can be thought
of as another way of representing the information in a table showing false positives, false
negatives, true positives and true negatives.
How to define criteria for when the test set passes or fails will depend on the context. Given that
it’s likely to be impossible to have both 100% coverage and 100% accuracy, is it better to have high
coverage or high accuracy? Within accuracy, are some categories, or some kinds of mis-
categorisation, more important than others? For instance, given that cats are more similar in how
they fit into human society than they are to volcanoes, if a cat is mis-categorised as a dog is that a
bigger or smaller problem than if it’s mis-categorised as a volcano?
Quantum
Image credit
(https://www.flickr.com/photos/31690139@N02/2965956885)
under Creative Commons Attribution 2.0 Generic
(https://creativecommons.org/licenses/by/2.0/)
The important part is that before measurement, a qubit has a probability that it will deliver the
value 0 when it’s measured (and 100% minus that probability that it will deliver 1 when it’s
measured). The goal of quantum code is to massage the probability on its qubits such that
probability is moved towards the correct answer[s] and away from the incorrect answer[s]. (These
operations that move probability aren’t the same as measurement, so a qubit is still in a
superposition of states, it’s just that one state becomes more likely than the other one.)
some of the time (and the correct answer the rest of the time, on the same inputs).
If you run some code and get the wrong answer, is that evidence that your code has a bug? It
depends on how often it delivers the wrong answer, and how this compares to the expected
frequency of a wrong answer. At this point it’s possible that you think that this is a physical
(quantum) system that happens to include code, so you start reaching for tools usually used for
analysing data in science, such as hypothesis testing and p-values
(https://www.youtube.com/watch?v=vemZtEM63GY). For instance, you create a hypothesis that
the code delivers correct results 87% of the time, and you want to be at least 95% sure of this (you
use p=0.05). This then directs how you will test your code.
The source of variability in this case is quantum physics! It is probabilistic, and so code that uses it
(quantum code) will also be probabilistic.
Summary
Much of the time, a single failing test can reliably tell you useful information about some code. By
this I mean that the code isn’t behaving as expected. However, there are cases where there is too
much variation in either the problem or the implementation of its solution for this to be possible.
These range from conventional cases, such as performance tests, to less conventional cases, such as
quantum.
If a single failing test is unreliable, it’s worth looking at grouping together a set of related tests, in
case the system’s behaviour is predictable over enough cases, i.e. in aggregate.
BLOG AT WORDPRESS.COM.