0% found this document useful (0 votes)
41 views6 pages

When A Failing Test Might Be OK - Random Tech Thoughts

Uploaded by

coneac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views6 pages

When A Failing Test Might Be OK - Random Tech Thoughts

Uploaded by

coneac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

When a failing test might be OK – Random Tech Thoughts 29.05.

2023, 08:25

Random Tech Thoughts

When a failing test might be OK


MAY 21, 2023MAY 22, 2023 ~ BOB
Usually, a failing test is a problem. In this article I will cover three cases where this might not
always be true: performance tests, testing a data science model such as a classifier, and testing in
quantum computing. In all these cases, a definitive answer about passing or failing is given by a
set of tests, rather than by a single test.

Dividing jobs between tests

In automated tests, each test does a different job – why bother having more than one test to deliver
some information? It’s common for there to be more test code than production code and
maintaining test code is a cost that can’t be ignored. Redundancy in the test code means that this
maintenance cost is bigger than it needs to be.

This means that any test failing is a sign that at least one part of the system isn’t behaving as
expected.

However, there are times where there is important variability – either intrinsic to the problem
being solved, or unavoidable variability in the way the problem is solved. When it’s too hard to
predict this variability accurately or how it will affect the test outcome, one approach is to create a
set of related tests, and this set is in many ways treated as a single test. The idea is that, while it’s
too hard to predict the behaviour of the system via a single (component) test, it is still possible to
predict its behaviour in general, i.e. across the set of tests.

In this case there are usually two levels of specification:

1. What makes a component test pass or fail?


2. What makes the set of tests pass or fail?

I’ll go into some examples below, where I’ll describe what gives rise to the variability and how the
success criteria are defined for the tests/set of tests.

Note that flaky tests are a similar but different problem. By flaky tests I mean tests that sometimes
pass and sometimes fail, and so give unreliable results. This is often due to variability in the order
in which different parts of the production or test code are executed, and this variability trips the
tests up. Flaky tests are something that should be fixable so that individual tests reliably pass or

https://randomtechthoughts.blog/2023/05/21/when-a-failing-test-…source=Coding_Jag&utm_medium=Email&utm_campaign=Coding_ jag_141 Page 1 of 6


When a failing test might be OK – Random Tech Thoughts 29.05.2023, 08:25

fail, but this might require changes to the production code as well as to the test code. The rest of
this article concerns times when variability can’t be dealt with such that individual tests are
reliable.

Performance tests

In this context, I’m using performance as a synonym for latency – how long will the system take to
respond to a request? One way to specify performance requirements is in terms of percentages.
For instance:

95% of requests must be processed in at most 0.1 seconds,


100% of requests must be processed in at most 0.5 seconds.

The performance requirements might be motivated by several things. One might be to ensure the
user gets a good user experience (https://randomtechthoughts.blog/category/user-experience/)
via a GUI. How quickly the GUI and systems such as API, database etc. behind respond to a user
request will influence the user’s perception of and enjoyment of the system. Alternatively, there
might not be any GUI or user involved directly, but the production system is an API that’s called
by other code. The performance requirements might be to ensure that many separate bits of code
can collaborate to create a bigger system, such as a phone network
(https://en.wikipedia.org/wiki/Signalling_System_No._7).

To test the system against its performance requirements, a suitable set of requests is created and
sent to the system under test. The meaning of suitable depends on context, but it is likely to follow
the pattern of requests that has been observed in production already. For instance, for an online
banking system one part of the pattern could be:

1. Log into the system


2. Look at the current balance on an account
3. Transfer money from that account to another account e.g. to pay a bill

In the light of the requirements above, if a request takes 0.7 seconds during a test – is that OK?
Definitely not. If a request takes 0.09 seconds, that’s fine. If a request takes 0.3 seconds, then
things might be OK. If too many other requests are also in the 0.1 – 0.5 second range, then the set
of tests as a whole fails.

Why can’t performance be specified more tightly? There are two sources of variation – the
mixture of different kinds of request, and real-world limitations of the implementation (which I
will explain shortly). The different kinds of request (e.g. the banking related ones above) all need
to be completed before the user thinks things are too slow, even if they need different amounts of
work to complete.

https://randomtechthoughts.blog/2023/05/21/when-a-failing-test…source=Coding_Jag&utm_medium=Email&utm_campaign=Coding_ jag_141 Page 2 of 6


When a failing test might be OK – Random Tech Thoughts 29.05.2023, 08:25

The other source of variation is the limitations of the implementation. There are many helpful lies
that one part of the system tells to other parts. The lies are essentially saying that things are
simpler or better than they actually are. Enough of the time the pretence holds, but occasionally
the lie shows through and this affects performance.

The kinds of thing that I mean by helpful lie are:

The database / CPU / etc. is faster than it actually is – a lie involving some kind of cache;
The CPU, memory, network, and other important resources that some process needs are used
only by that process and nothing else – a lie involving virtualisation and other ways to share
things.

In a cache – whether this is an instruction cache on a CPU or a cache of queries in a database –


there is an operation that’s relatively slow and expensive. Operations seen recently have their
results stored in a way that makes it quicker to look up the results than to do the operation from
scratch. However, the cache isn’t infinitely big, so old or infrequently used entries in the cache are
evicted to make way for others once the cache gets full. That means that the next time an evicted
operation is attempted, it will have its full cost and not the reduced cost from using the cache.

Very early computers exposed programmers to all the details of their hardware, so having two or
more programs running on a computer at once became tricky. How will they share the CPU,
memory, disk space etc? More modern computers take that burden off the programmer and
handle it in things like the operating system.

The operating system creates illusions such as virtual memory – a contiguous chunk of memory
that is solely for one program, even though behind the scenes this is made of several separate
chunks of physical memory, and many different programs are using the physical memory at once.
Similarly, each program thinks it’s running on a CPU dedicated to running just that program. In
reality, each program gets a series of slices of CPU time, to allow the CPU to be shared across
many programs.

Enough of the time, the difference between appearance and reality is fine. However, it can
sometimes cause delays in the execution of code. For example, there is a program that is blocked
waiting for a long database query, and so the operating system has decided to divert some of its
physical memory to be used by another program that isn’t currently blocked on anything. Before
this happens, the contents of the memory are written to disk. A little while later, when the
database query finishes, the program is ready to run again and so needs that bit of memory back.
There will therefore be a delay while the data is read from disk back to memory.

These costs will happen at hard to predict times, as they are based on the interactions of many
moving parts at many levels of abstraction. Therefore, it’s easier to describe the system’s latency
in general terms such as percentages and ranges.

Data science
https://randomtechthoughts.blog/2023/05/21/when-a-failing-test…source=Coding_Jag&utm_medium=Email&utm_campaign=Coding_ jag_141 Page 3 of 6
When a failing test might be OK – Random Tech Thoughts 29.05.2023, 08:25

In a previous article I described fuzzy matching


(https://randomtechthoughts.blog/2023/04/12/fuzzy-matching-introduction/). This can be
used as part of classification – seeing what class or category a given thing fits into. For instance, a
system could be trained to say if a photo contains: a cat, a dog, a volcano, or something else.

In this situation, the variability is intrinsic to the problem. Not all cats look the same, and a given
cat will look different from different angles, in different lights, or in different poses. It is usually
unrealistic to expect a classifier to get 100% accurate results all the time. Sometimes it will come
up with the wrong answer (poor accuracy) or won’t be able to give any answer (poor coverage).

One way to represent the behaviour of a classifier is with a confusion matrix, such as the one
below:

Actual
Cat Dog Volcano Other Don’t know
Expected Cat 24 2 2
Dog 38 1
Volcano 15
Other 6 11 1
The numbers show a percentage of all images (tests) in the set of tests. For instance, 24% of images
are cats that are correctly classified as a cat. 2% of images are cats that are (incorrectly) classified
as a volcano etc. Cells containing zero are left blank for clarity. The diagonal line of italic numbers
is where the system is behaving as intended – the actual result matches the expected result.
Everything else is some kind of error.

The right-hand column shows how much of a coverage problem the classifier has, i.e. how many
times it has failed to come up with any answer. The cells that are neither italic nor in the right-
hand column show how much of an accuracy problem the classifier has, i.e. how many times it has
come up with the wrong answer.

If there are only two categories (and no don’t know column) then a confusion matrix can be thought
of as another way of representing the information in a table showing false positives, false
negatives, true positives and true negatives.

How to define criteria for when the test set passes or fails will depend on the context. Given that
it’s likely to be impossible to have both 100% coverage and 100% accuracy, is it better to have high
coverage or high accuracy? Within accuracy, are some categories, or some kinds of mis-
categorisation, more important than others? For instance, given that cats are more similar in how
they fit into human society than they are to volcanoes, if a cat is mis-categorised as a dog is that a
bigger or smaller problem than if it’s mis-categorised as a volcano?

https://randomtechthoughts.blog/2023/05/21/when-a-failing-test…source=Coding_Jag&utm_medium=Email&utm_campaign=Coding_ jag_141 Page 4 of 6


When a failing test might be OK – Random Tech Thoughts 29.05.2023, 08:25

Quantum

If you’re unfamiliar with quantum computing, please refer to my article introducing it


(https://randomtechthoughts.blog/2020/09/01/a-beginners-guide-to-practical-quantum-
computing/). A summary is as follows. There are special variables called qubits. A qubit exists in
two states: before measurement and after measurement. Before measurement is equivalent to the
box that contains Schrödinger’s cat before the box is opened. Instead of being a box containing a
cat that is both dead and alive, a qubit before measurement contains both 0 and 1. After
measurement, a qubit contains only 0 or 1, like a bit in conventional programming. This is
equivalent to Schrödinger’s cat’s box once it has been opened – the cat is definitely dead or alive /
the qubit is definitely 0 or 1.

Image credit
(https://www.flickr.com/photos/31690139@N02/2965956885)
under Creative Commons Attribution 2.0 Generic
(https://creativecommons.org/licenses/by/2.0/)

The important part is that before measurement, a qubit has a probability that it will deliver the
value 0 when it’s measured (and 100% minus that probability that it will deliver 1 when it’s
measured). The goal of quantum code is to massage the probability on its qubits such that
probability is moved towards the correct answer[s] and away from the incorrect answer[s]. (These
operations that move probability aren’t the same as measurement, so a qubit is still in a
superposition of states, it’s just that one state becomes more likely than the other one.)

There’s a class of problems, called BQP (https://en.wikipedia.org/wiki/BQP), where the best


approach has a less than 100% probability of delivering the correct answer. (The probability is at
least 2/3.) This means that bug-free code running on valid inputs will deliver the wrong answer

https://randomtechthoughts.blog/2023/05/21/when-a-failing-test…source=Coding_Jag&utm_medium=Email&utm_campaign=Coding_ jag_141 Page 5 of 6


When a failing test might be OK – Random Tech Thoughts 29.05.2023, 08:25

some of the time (and the correct answer the rest of the time, on the same inputs).

If you run some code and get the wrong answer, is that evidence that your code has a bug? It
depends on how often it delivers the wrong answer, and how this compares to the expected
frequency of a wrong answer. At this point it’s possible that you think that this is a physical
(quantum) system that happens to include code, so you start reaching for tools usually used for
analysing data in science, such as hypothesis testing and p-values
(https://www.youtube.com/watch?v=vemZtEM63GY). For instance, you create a hypothesis that
the code delivers correct results 87% of the time, and you want to be at least 95% sure of this (you
use p=0.05). This then directs how you will test your code.

The source of variability in this case is quantum physics! It is probabilistic, and so code that uses it
(quantum code) will also be probabilistic.

Summary

Much of the time, a single failing test can reliably tell you useful information about some code. By
this I mean that the code isn’t behaving as expected. However, there are cases where there is too
much variation in either the problem or the implementation of its solution for this to be possible.
These range from conventional cases, such as performance tests, to less conventional cases, such as
quantum.

If a single failing test is unreliable, it’s worth looking at grouping together a set of related tests, in
case the system’s behaviour is predictable over enough cases, i.e. in aggregate.

BLOG AT WORDPRESS.COM.

https://randomtechthoughts.blog/2023/05/21/when-a-failing-test…source=Coding_Jag&utm_medium=Email&utm_campaign=Coding_ jag_141 Page 6 of 6

You might also like