0% found this document useful (0 votes)
38 views14 pages

Universal Prediction Theory Analysis

Uploaded by

zumee.schplancer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views14 pages

Universal Prediction Theory Analysis

Uploaded by

zumee.schplancer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Is there an Elegant Universal

Theory of Prediction?

Shane Legg

Technical Report No. IDSIA-12-06


October 19, 2006

IDSIA / USI-SUPSI
Dalle Molle Institute for Artificial Intelligence
Galleria 2, 6928 Manno, Switzerland

IDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland
(SUPSI), and was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.
Technical Report No. IDSIA-12-06 1

Is there an Elegant Universal


Theory of Prediction? ∗
Shane Legg†
October 19, 2006

Abstract
Solomonoff’s inductive learning model is a powerful, universal and highly elegant theory
of sequence prediction. Its critical flaw is that it is incomputable and thus cannot be used in
practice. It is sometimes suggested that it may still be useful to help guide the development
of very general and powerful theories of prediction which are computable. In this paper it
is shown that although powerful algorithms exist, they are necessarily highly complex. This
alone makes their theoretical analysis problematic, however it is further shown that beyond a
moderate level of complexity the analysis runs into the deeper problem of Gödel incomplete-
ness. This limits the power of mathematics to analyse and study prediction algorithms, and
indeed intelligent systems in general.

1 Introduction
Solomonoff’s model of induction rapidly learns to make optimal predictions for any computable
sequence, including probabilistic ones [13, 14]. It neatly brings together the philosophical princi-
ples of Occam’s razor, Epicurus’ principle of multiple explanations, Bayes theorem and Turing’s
model of universal computation into a theoretical sequence predictor with astonishingly powerful
properties. Indeed the problem of sequence prediction could well be considered solved [9, 8], if it
were not for the fact that Solomonoff’s theoretical model is incomputable.
Among computable theories there exist powerful general predictors, such as the Lempel-Ziv
algorithm [5] and Context Tree Weighting [18], that can learn to predict some complex sequences,
but not others. Some prediction methods, based on the Minimum Description Length principle [12]
or the Minimum Message Length principle [17], can even be viewed as computable approximations
of Solomonoff induction [10]. However in practice their power and generality are limited by
the power of the compression methods employed, as well as having a significantly reduced data
efficiency as compared to Solomonoff induction [11].
Could there exist elegant computable prediction algorithms that are in some sense universal?
Unfortunately this is impossible, as pointed out by Dawid [4]. Specifically, he notes that for any
statistical forecasting system there exist sequences which are not calibrated. Dawid also notes that
a forecasting system for a family of distributions is necessarily more complex than any forecasting
∗ This work was supported by SNF grant 200020-107616.
† shane@[Link]
Technical Report No. IDSIA-12-06 2

system generated from a single distribution in the family. However, he does not deal with the
complexity of the sequences themselves, nor does he make a precise statement in terms of a
specific measure of complexity, such as Kolmogorov complexity. The impossibility of forecasting
has since been developed in considerably more depth by V’yugin [16], in particular he proves that
there is an efficient randomised procedure producing sequences that cannot be predicted (with
high probability) by computable forecasting systems.
In this paper we study the prediction of computable sequences from the perspective of Kol-
mogorov complexity. The central question we look at is the prediction of sequences which have
bounded Kolmogorov complexity. This leads us to a new notion of complexity: rather than the
length of the shortest program able to generate a given sequence, in other words Kolmogorov com-
plexity, we take the length of the shortest program able to learn to predict the sequence. This new
complexity measure has the same fundamental invariance property as Kolmogorov complexity,
and a number of strong relationships between the two measures are proven. However in general
the two may diverge significantly. For example, although a long random string that indefinitely
repeats has a very high Kolmogorov complex, this sequence also has a relatively simple structure
that even a simple predictor can learn to predict.
We then prove that some sequences, however, can only be predicted by very complex predictors.
This implies that very general prediction algorithms, in particular those that can learn to predict
all sequences up to a given Kolmogorov complex, must themselves be complex. This puts an end
to our hope of there being an extremely general and yet relatively simple prediction algorithm. We
then use this fact to prove that although very powerful prediction algorithms exist, they cannot
be mathematically discovered due to Gödel incompleteness. Given how fundamental prediction is
to intelligence, this result implies that beyond a moderate level of complexity the development of
powerful artificial intelligence algorithms can only be an experimental science.

2 Preliminaries
An alphabet A is a finite set of 2 or more elements which are called symbols. In this paper we
will assume a binary alphabet B := {0, 1}, though all the results can easily be generalised to
other alphabets. A string is a finite ordered n-tuple of symbols denoted x := x1 x2 . . . xn where
∀i ∈ {1, . . . , n}, xi ∈ B, or more succinctly, x ∈ Bn . The 0-tuple is denoted S λ and is called the
null string. The expression B≤n has the obvious interpretation, and B∗ := n∈N Bn . The length
lexicographical ordering is a total order on B∗ defined as λ < 0 < 1 < 00 < 01 < 10 < 11 < 000 <
001 < · · ·. A substring of x is defined xj:k := xj xj+1 . . . xk where 1 ≤ j ≤ k ≤ n. By |x| we mean
the length of the string x, for example, |xj:k | = k − j + 1. We will sometimes need to encode a
natural number as a string. Using simple encoding techniques it can be shown that there exists
a computable injective function f : N → B∗ where no string in the range of f is a prefix of any
other, and ∀n ∈ N : |f (n)| ≤ log2 n + 2 log2 log2 n + 1 = O(log n).
Unlike strings which always have finite length, a sequence ω is an infinite list of symbols
x1 x2 x3 . . . ∈ B∞ . Of particular interest to us will be the class of sequences which can be generated
by an algorithm executed on a universal Turing machine:

2.1 Definition. A monotone universal Turing machine U is defined as a universal Turing


machine with one unidirectional input tape, one unidirectional output tape, and some bidirectional
work tapes. Input tapes are read only, output tapes are write only, unidirectional tapes are those
Technical Report No. IDSIA-12-06 3

where the head can only move from left to right. All tapes are binary (no blank symbol) and the
work tapes are initially filled with zeros. We say that U outputs/computes a sequence ω on input
p, and write U(p) = ω, if U reads all of p but no more as it continues to write ω to the output
tape.

We fix U and define U(p, x) by simply using a standard coding technique to encode a program
p along with a string x ∈ B∗ as a single input string for U.

2.2 Definition. A sequence ω ∈ B∞ is a computable binary sequence if there exists a


program q ∈ B that writes ω to a one-way output tape when run on a monotone universal Turing

machine U, that is, ∃q ∈ B∗ : U(q) = ω. We denote the set of all computable sequences by C.

A similar definition for strings is not necessary as all strings have finite length and are therefore
trivially computable.

2.3 Definition. A computable binary predictor is a program p ∈ B∗ that on a universal


Turing machine U computes a total function B∗ → B.

For simplicity of notation we will often write p(x) to mean the function computed by the
program p when executed on U along with the input string x, that is, p(x) is short hand for
U(p, x). Having x1:n as input, the objective of a predictor is for its output, called its prediction,
to match the next symbol in the sequence. Formally we express this by writing p(x1:n ) = xn+1 .
As the algorithmic prediction of incomputable sequences, such as the halting sequence, is
impossible by definition, we only consider the problem of predicting computable sequences. To
simplify things we will assume that the predictor has an unlimited supply of computation time
and storage. We will also make the assumption that the predictor has unlimited data to learn
from, that is, we are only concerned with whether or not a predictor can learn to predict in the
following sense:

2.4 Definition. We say that a predictor p can learn to predict a sequence ω := x1 x2 . . . ∈ B∞


if there exists m ∈ N such that ∀n ≥ m : p(x1:n ) = xn+1 .

The existence of m in the above definition need not be constructive, that is, we might not
know when the predictor will stop making prediction errors for a given sequence, just that this
will occur eventually. This is essentially “next value” prediction as characterised by Barzdin [1],
which follows from Gold’s notion of identifiability in the limit for languages [7].

2.5 Definition. Let P (ω) be the set of T all predictors able to learn to predict ω. Similarly for
sets of sequences S ⊂ B∞ , define P (S) := ω∈S P (ω).

A standard measure of complexity for sequences is the length of the shortest program which
generates the sequence:
2.6 Definition. For any sequence ω ∈ B∞ the monotone Kolmogorov complexity of the
sequence is,
K(ω) := min∗ {|q| : U(q) = ω},
q∈B

where U is a monotone universal Turing machine. If no such q exists, we define K(ω) := ∞.


Technical Report No. IDSIA-12-06 4

It can be shown that this measure of complexity depends on our choice of universal Turing
machine U, but only up to an additive constant that is independent of ω. This is due to the fact
that a universal Turing machine can simulate any other universal Turing machine with a fixed
length program.
In essentially the same way as the definition above we can define the Kolmogorov complexity
of a string x ∈ Bn , written K(x), by requiring that U(q) halts after generating x on the output
tape. For an extensive treatment of Kolmogorov complexity and some of its applications see [10]
or [2].
As many of our results will have the above property of holding within an additive constant
that is independent of the variables in the expression, we will indicate this by placing a small plus
+
above the equality or inequality symbol. For example, f (x) < g(x) means that that ∃c ∈ R, ∀x :
f (x) < g(x) + c. When using standard “Big O” notation this is unnecessary as expressions are
already understood to hold within an independent constant, however for consistency of notation
we will use it in these cases also.

3 Prediction of computable sequences


The most elementary result is that every computable sequence can be predicted by at least one
predictor, and that this predictor need not be significantly more complex than the sequence to be
predicted.
+
3.1 Lemma. ∀ω ∈ C, ∃p ∈ P (ω) : K(p) < K(ω).

Proof. As the sequence ω is computable, there must exist at least one algorithm that generates
ω. Let q be the shortest such algorithm and construct an algorithm p that “predicts” ω as follows:
Firstly the algorithm p reads x1:n to find the value of n, then it runs q to generate x1:n+1 and
returns xn+1 as its prediction. Clearly p perfectly predicts ω and |p| < |q| + c, for some small
constant c that is independent of ω and q. 2

Not only can any computable sequence be predicted, there also exist very simple predictors
able to predict arbitrarily complex sequences:

3.2 Lemma. There exists a predictor p such that ∀n ∈ N, ∃ ω ∈ C : p ∈ P (ω) and K(ω) > n.

Proof. Take a string x such that K(x) = |x| ≥ 2n, and from this define a sequence ω := x0000 . . ..
Clearly K(ω) > n and yet a simple predictor p that always predicts 0 can learn to predict ω. 2

The predictor used in the above proof is very simple and can only “learn” sequences that end
with all 0’s, albeit where the initial string can have arbitrarily high Kolmogorov complexity. It
may seem that this is due to sequences that are initially complex but where the “tail complexity”,
defined lim inf i→∞ K(ωi:∞ ), is zero. This is not the case:

3.3 Lemma. There exists a predictor p such that ∀n ∈ N, ∃ ω ∈ C : p ∈ P (ω) and


lim inf i→∞ K(ωi:∞ ) > n.

Proof. A predictor p for eventually periodic sequences can be defined as follows: On input
ω1:k the predictor goes through the ordered pairs (1, 1), (1, 2), (2, 1), (1, 3), (2, 2), (3, 1), (1, 4), . . .
Technical Report No. IDSIA-12-06 5

checking for each pair (a, b) whether the string ω1:k consists of an initial string of length a followed
by a repeating string of length b. On the first match that is found p predicts that the repeating
string continues, and then p halts. If a + b > k before a match is found, then p outputs a fixed
symbol and halts. Clearly K(p) is a small constant and p will learn to predict any sequence that
is eventually periodic.
For any (m, n) ∈ N2 , let ω := x(y ∗ ) where x ∈ Bm , and y ∈ Bn is a random string, that
is, K(y) = n. As ω is eventually periodic p ∈ P (ω) and also we see that lim inf i→∞ K(ωi:∞ ) =
min{K(ωm+1:∞ ), K(ωm+2:∞ ), . . . , K(ωm+n:∞ )}.
For any k ∈ {1, . . . , n} let qk∗ be the shortest program that can generate ωm+k:∞ . We can define
a halting program qk′ that outputs y where this program consists of qk∗ , n and k. Thus, |qk′ | =
|qk∗ | + O(log n) = K(ωk:∞ ) + O(log n). As n = K(y) ≤ |qk′ |, we see that K(ωk:∞ ) > n − O(log n).
As n and k are arbitrary the result follows. 2

Using a more sophisticated version of this proof it can be shown that there exist predictors
that can learn to predict arbitrary regular or primitive recursive sequences. Thus we might wonder
whether there exists a computable predictor able to learn to predict all computable sequences.
Unfortunately, no universal predictor exists, indeed for every predictor there exists a sequence
which it cannot predict at all:
3.4 Lemma. For any predictor p there constructively exists a sequence ω := x1 x2 . . . ∈ C such
+
that ∀n ∈ N : p(x1:n ) 6= xn+1 and K(ω) < K(p).
Proof. For any computable predictor p there constructively exists a computable sequence ω =
x1 x2 x3 . . . computed by an algorithm q defined as follows: Set x1 = 1 − p(λ), then x2 = 1 − p(x1 ),
then x3 = 1 − p(x1:2 ) and so on. Clearly ω ∈ C and ∀n ∈ N : p(x1:n ) = 1 − xn+1 .
Let p∗ be the shortest program that computes the same function as p and define a sequence
generation algorithm q ∗ based on p∗ using the procedure above. By construction, |q ∗ | = |p∗ | + c
for some constant c that is independent of p∗ . Because q ∗ generates ω, it follows that K(ω) ≤ |q ∗ |.
+
By definition K(p) = |p∗ | and so K(ω) < K(p). 2

Allowing the predictor to be probabilistic does not fundamentally avoid the problem of
Lemma 3.4. In each step, rather than generating the opposite to what will be predicted by p,
instead q attempts to generate the symbol which p is least likely
 to predict given x1:n . To do this
q must simulate p in order to estimate Pr p(x1:n ) = 1 x1:n . With sufficient simulation effort, q
can estimate this probability to any desired accuracyfor any x1:n . This produces a computable
sequence ω such that ∀n ∈ N : Pr p(x1:n ) = xn+1 x1:n is not significantly greater than 12 , that is,
the performance of p is no better than a predictor that makes completely random predictions.
As probabilistic prediction complicates things without avoiding this fundamental problem, in
the remainder of this paper we will consider only deterministic predictors. This will also allow
us to see the roots of this problem as clearly as possible. With the preliminaries covered, we
now move on to the central problem considered in this paper: Predicting sequences of limited
Kolmogorov complexity.

4 Prediction of simple computable sequences


As the computable prediction of any computable sequence is impossible, a weaker goal is to be
able to predict all “simple” computable sequences.
Technical Report No. IDSIA-12-06 6

4.1 Definition. For n ∈ N, let Cn := {ω ∈ C : K(ω) ≤ n}. Further, let Pn := P (Cn ) be the set
of predictors able to learn to predict all sequences in Cn .

Firstly we establish that prediction algorithms exist that can learn to predict all sequences up
to a given complexity, and that these predictors need not be significantly more complex than the
sequences they can predict:
+
4.2 Lemma. ∀n ∈ N, ∃p ∈ Pn : K(p) < n + O(log n).

Proof. Let h ∈ N be the number of programs of length n or less which generate infinite sequences.
Build the value of h into a prediction algorithm p constructed as follows:
In the k th prediction cycle run in parallel all programs of length n or less until h of these
programs have each produced k + 1 symbols of output. Next predict according to the k + 1th
symbol of the generated string whose first k symbols is consistent with the observed string. If two
generated strings are consistent with the observed sequence (there cannot be more than two as the
strings are binary and have length k + 1), pick the one which was generated by the program that
occurs first in a lexicographical ordering of the programs. If no generated output is consistent,
give up and output a fixed symbol.
For sufficiently large k, only the h programs which produce infinite sequences will produce
output strings of length k + 1. As this set of sequences is finite, they can be uniquely identified
by finite initial strings. Thus for sufficiently large k the predictor p will correctly predict any
computable sequence ω for which K(ω) ≤ n, that is, p ∈ Pn .
As there are 2n+1 − 1 possible strings of length n or less, h < 2n+1 and thus we can encode h
with log2 h + 2 log2 log2 h = n + 1 + 2 log2 (n + 1) bits. Thus, K(p) < n + 1 + 2 log2 (n + 1) + c for
some constant c that is independent of n. 2

Can we do better than this? Lemmas 3.2 and 3.3 shows us that there exist predictors able to
predict at least some sequences vastly more complex than themselves. This suggests that there
might exist simple predictors able to predict arbitrary sequences up to a high complexity. Formally,
could there exist p ∈ Pn where n ≫ K(p)? Unfortunately, these simple but powerful predictors
are not possible:
+
4.3 Theorem. ∀n ∈ N : p ∈ Pn ⇒ K(p) > n.

Proof. For any n ∈ N let p ∈ Pn , that is, ∀ω ∈ Cn : p ∈ P (ω). By Lemma 3.4 we know that
∃ ω′ ∈ C : p ∈
/ P (ω ′ ) . As p ∈
/ P (ω ′ ) it must be the case that ω ′ ∈
/ Cn , that is, K(ω ′ ) ≥ n. From
+

Lemma 3.4 we also know that K(p) > K(ω ) and so the result follows. 2

Intuitively the reason for this is as follows: Lemma 3.4 guarantees that every simple predictor
fails for at least one simple sequence. Thus if we want a predictor that can learn to predict all
sequences up to a moderate level of complexity, then clearly the predictor cannot be simple. Like-
wise, if we want a predictor that can predict all sequences up to a high level of complexity, then the
predictor itself must be very complex. Thus, even though we have made the generous assumption
of unlimited computational resources and data to learn from, only very complex algorithms can
be truly powerful predictors.
These results easily generalise to notions of complexity that take computation time into consid-
eration. As sequences are infinite, the appropriate measure of time is the time needed to generate
Technical Report No. IDSIA-12-06 7

or predict the next symbol in the sequence. Under any reasonable measure of time complexity,
the operation of inverting a single output from a binary valued function can be performed with
little cost. If C is any complexity measure with this property, it is trivial to see that the proof of
Lemma 3.4 still holds for C. From this, an analogue of Theorem 4.3 for C easily follows.
With similar arguments these results also generalise in a straightforward way to complexity
measures that take space or other computational resources into account. Thus, the fact that
extremely powerful predictors must be very complex, holds under any measure of complexity for
which inverting a single bit is inexpensive.

5 Complexity of prediction
Another way of viewing these results is in terms of an alternate notion of sequence complexity
defined as the size of the smallest predictor able to learn to predict the sequence. This allows us
to express the results of the previous sections more concisely. Formally, for any sequence ω define
the complexity measure,
K̇(ω) := min∗ {|p| : p ∈ P (ω)},
p∈B

and K̇(ω) := ∞ if P (ω) = ∅. Thus, if K̇(ω) is high then the sequence ω is complex in the sense
that only complex prediction algorithms are able to learn to predict it. It can easily be seen
that this notion of complexity has the same invariance to the choice of reference universal Turing
machine as the standard Kolmogorov complexity measure.
It may be tempting to conjecture that this definition simply describes what might be called
the “tail complexity” of a sequence, that is, K̇(ω) is equal to lim inf i→∞ K(ωi:∞ ). This is not the
case. In the proof of Lemma 3.3 we saw that there exists a single predictor capable of learning
to predict any sequence that consists of a repeating string, and thus for these sequences K̇ is
bounded. It was further shown that there exist sequences of this form with arbitrarily high tail
complexity. Clearly then tail complexity and K̇ cannot be equal in general.
Using K̇ we can now rewrite a number of our previous results much more succinctly. From
Lemma 3.1 it immediately follows that,
+
∀ω : 0 ≤ K̇(ω) < K(ω).
From Lemma 3.2 we know that ∃c ∈ N, ∀n ∈ N, ∃ ω ∈ C such that K̇(ω) < c and K(ω) > n, that
is, K̇ can attain the lower bound above within a small constant, no matter how large the value of
K is. The sequences for which the upper bound on K̇ is tight are interesting as they are the ones
which demand complex predictors. We prove the existence of these sequences and look at some of
their properties in the next section.
The complexity measure K̇ can also be generalised to sets of sequences, for S ⊂ B∞ define
K̇(S) := minp {|p| : p ∈ P (S)}. This allows us to rewrite Lemma 4.2 and Theorem 4.3 as simply,
+ +
∀n ∈ N : n < K̇(Cn ) < n + O(log n).
This is just a restatement of the fact that the simplest predictor capable of predicting all sequences
up to a Kolmogorov complexity of n, has itself a Kolmogorov complexity of roughly n.
Perhaps the most surprising thing about K̇ complexity is that this very natural definition of
the complexity of a sequence, as viewed from the perspective of prediction, does not appear to
have been studied before.
Technical Report No. IDSIA-12-06 8

6 Hard to predict sequences


We have already seen that some individual sequences, such as the repeating string used in the
proof of Lemma 3.3, can have arbitrarily high Kolmogorov complexity but nevertheless can be
predicted by trivial algorithms. Thus, although these sequences contain a lot of information in
the Kolmogorov sense, in a deeper sense their structure is very simple and easily learnt.
What interests us in this section is the other extreme; individual sequences which can only be
predicted by complex predictors. As we are only concerned with prediction in the limit, this extra
complexity in the predictor must be some kind of special information which cannot be learnt just
through observing the sequence. Our first task is to show that these difficult to predict sequences
exist.
+ + +
6.1 Theorem. ∀n ∈ N, ∃ ω ∈ C : n < K̇(ω) < K(ω) < n + O(log n).

Proof. For any n ∈ N, let Qn ⊂ B<n be the set of programs shorter than n that are predictors,
and let x1:k ∈ Bk be the observed initial string from the sequence ω which is to be predicted. Now
construct a meta-predictor p̂:
By dovetailing the computations, run in parallel every program of length less than n on every
string in B≤k . Each time a program is found to halt on all of these input strings, add the program
to a set of “candidate prediction algorithms”, called Q̃kn . As each element of Qn is a valid predictor,
and thus halts for all input strings in B∗ by definition, for every n and k it eventually will be the
case that |Q̃kn | = |Qn |. At this point the simulation to approximate Qn terminates. It is clear that
for sufficiently large values of k all of the valid predictors, and only the valid predictors, will halt
with a single symbol of output on all tested input strings. That is, ∃r ∈ N, ∀k > r : Q̃kn = Qn .
The second part of the p̂ algorithmPuses these candidate prediction algorithms to make a
k−1
prediction. For p ∈ Q̃kn define dk (p) := i=1 |p(x1:i ) − xi+1 |. Informally, dk (p) is the number of
prediction errors made by p so far. Compute this for all p ∈ Q̃kn and then let p∗k ∈ Q̃kn be the
program with minimal dk (p). If there is more than one such program, break the tie by letting p∗k
be the lexicographically first of these. Finally, p̂ computes the value of p∗k (x1:k ) and then returns
this as its prediction and halts.
By Lemma 3.4, there exists ω ′ ∈ C such that p̂ makes a prediction error for every k when trying
to predict ω ′ . Thus, in each cycle at least one of the finitely many predictors with minimal dk
makes a prediction error and so ∀p ∈ Qn : dk (p) → ∞ as k → ∞. Therefore, ∄p ∈ Qn : p ∈ P (ω ′ ),
that is, no program of length less than n can learn to predict ω ′ and so n ≤ K̇(ω ′ ). Further, from
+ +
Lemma 3.1 we know that K̇(ω ′ ) < K(ω ′ ), and from Lemma 3.4 again, K(ω ′ ) < K(p̂).
Examining the algorithm for p̂, we see that it contains some fixed length program code and
an encoding of |Qn |, where |Qn | < 2n − 1. Thus, using a standard encoding method for integers,
+
K(p̂) < n + O(log n).
+ + + +
Chaining these together we get, n < K̇(ω ′ ) < K(ω ′ ) < K(p̂) < n + O(log n), which proves the
theorem. 2

This establishes the existence of sequences with arbitrarily high K̇ complexity which also have
a similar level of Kolmogorov complexity. Next we establish a fundamental property of high K̇
complexity sequences: they are extremely difficult to compute.
For an algorithm q that generates ω ∈ C, define tq (n) to be the number of computation steps
performed by q before the nth symbol of ω is written to the output tape. For example, if q is a
Technical Report No. IDSIA-12-06 9

simple algorithm that outputs the sequence 010101 . . ., then clearly tq (n) = O(n) and so ω can
be computed quickly. The following theorem proves that if a sequence can be computed in a
reasonable amount of time, then the sequence must have a low K̇ complexity:
+
6.2 Lemma. ∀ω ∈ C, if ∃q : U(q) = ω and ∃r ∈ N, ∀n > r : tq (n) < 2n , then K̇(ω) = 0.

Proof. Construct a prediction algorithm p̃ as follows:


On input x1:n , run all programs of length n or less, each for 2n+1 steps. In a set Wn collect
together all generated strings which are at least n + 1 symbols long and where the first n symbols
match the observed string x1:n . Now order the strings in Wn according to a lexicographical
ordering of their generating programs. If Wn = ∅, then just return a prediction of 1 and halt. If
|Wn | > 1 then return the n + 1th symbol from the first sequence in the above ordering.
Assume that ∃q : U(q) = ω such that ∃r ∈ N, ∀n > r : tq (n) < 2n . If q is not unique, take
q to be the lexicographically first of these. Clearly ∀n > r the initial string from ω generated
by q will be in the set Wn . As there is no lexicographically lower program which can generate ω
within the time constraint tq (n) < 2n for all n > r, for sufficiently large n the predictor p̃ must
converge on using q for each prediction and thus p̃ ∈ P (ω). As |p̃| is clearly a fixed constant that
+
is independent of ω, it follows then that K̇(ω) < |p̃| = 0. 2

We could replace the 2n bound in the above result with any monotonically growing computable
n
function, for example, 22 . In any case, this does not change the fundamental result that sequences
which have a high K̇ complexity are practically impossible to compute. However from our theoret-
ical perspective these sequences present no problem as they can be predicted, albeit with immense
difficulty.

7 The limits of mathematical analysis


One way to interpret the results of the previous sections is in terms of constructive theories of
prediction. Essentially, a constructive theory of prediction T , expressed in some sufficiently rich
formal system F, is in effect a description of a prediction algorithm with respect to a universal
Turing machine which implements the required parts of F. Thus from Theorems 4.3 and 6.1
it follows that if we want to have a predictor that can learn to predict all sequences up to a
high level of Kolmogorov complexity, or even just predict individual sequences which have high
K̇ complexity, the constructive theory of prediction that we base our predictor on must be very
complex. Elegant and highly general constructive theories of prediction simply do not exist, even
if we assume unlimited computational resources. This is in marked contrast to Solomonoff’s highly
elegant but non-constructive theory of prediction.
Naturally, highly complex theories of prediction will be very difficult to mathematically analyse,
if not practically impossible. Thus at some point the development of very general prediction
algorithms must become mainly an experimental endeavour due to the difficulty of working with
the required theory. Interestingly, an even stronger result can be proven showing that beyond
some point the mathematical analysis is in fact impossible, even in theory:

7.1 Theorem. In any consistent formal axiomatic system F that is sufficiently rich to express
statements of the form “p ∈ Pn ”, there exists m ∈ N such that for all n > m and for all predictors
p ∈ Pn the true statement “p ∈ Pn ” cannot be proven in F.
Technical Report No. IDSIA-12-06 10

In other words, even though we have proven that very powerful sequence prediction algorithms
exist, beyond a certain complexity it is impossible to find any of these algorithms using mathe-
matics. The proof has a similar structure to Chaitin’s information theoretic proof [3] of Gödel
incompleteness theorem for formal axiomatic systems [6].
Proof. For each n ∈ N let Tn be the set of statements expressed in the formal system F of
the form “p ∈ Pn ”, where p is filled in with the complete description of some algorithm in each
case. As the set of programs is denumerable, Tn is also denumerable and each element of Tn has
finite length. From Lemma 4.2 and Theorem 4.3 it follows that each Tn contains infinitely many
statements of the form “p ∈ Pn ” which are true.
Fix n and create a search algorithm s that enumerates all proofs in the formal system F
searching for a proof of a statement in the set Tn . As the set Tn is recursive, s can always
recognise a proof of a statement in Tn . If s finds any such proof, it outputs the corresponding
program p and then halts.
By way of contradiction, assume that s halts, that is, a proof of a theorem in Tn is found and
p such that p ∈ Pn is generated as output. The size of the algorithm s is a constant (a description
of the formal system F and some proof enumeration code) as well as an O(log n) term needed
+
to describe n. It follows then that K(p) < O(log n). However from Theorem 4.3 we know that
+
K(p) > n. Thus, for sufficiently large n, we have a contradiction and so our assumption of the
existence of a proof must be false. That is, for sufficiently large n and for all p ∈ Pn , the true
statement “p ∈ Pn ” cannot be proven within the formal system F. 2

The exact value of m depends on our choice of formal system F and which reference machine
U we measure complexity with respect to. However for reasonable choices of F and U the value of
m would be in the order of 1000. That is, the bound m is certainly not so large as to be vacuous.

8 Discussion
Solomonoff induction is an elegant and extremely general model of inductive learning. It neatly
brings together the philosophical principles of Occam’s razor, Epicurus’ principle of multiple expla-
nations, Bayes theorem and Turing’s model of universal computation into a theoretical sequence
predictor with astonishingly powerful properties. If theoretical models of prediction can have such
elegance and power, one cannot help but wonder whether similarly beautiful and highly general
computable theories of prediction are also possible.
What we have shown here is that there does not exist an elegant constructive theory of predic-
tion for computable sequences, even if we assume unbounded computational resources, unbounded
data and learning time, and place moderate bounds on the Kolmogorov complexity of the sequences
to be predicted. Very powerful computable predictors are therefore necessarily complex. We have
further shown that the source of this problem is computable sequences which are extremely expen-
sive to compute. While we have proven that very powerful prediction algorithms which can learn
to predict these sequences exist, we have also proven that, unfortunately, mathematical analysis
cannot be used to discover these algorithms due to problems of Gödel incompleteness.
These results can be extended to more general settings, specifically to those problems which
are equivalent to, or depend on, sequence prediction. Consider, for example, a reinforcement
learning agent interacting with an environment [15, 8]. In each interaction cycle the agent must
choose its actions so as to maximise the future rewards that it receives from the environment.
Technical Report No. IDSIA-12-06 11

Figure 1: Theorem 4.3 rules out simple but powerful artificial intelligence algorithms, as indicated
by the greyed out region on the lower right. Theorem 7.1 upper bounds how powerful an algorithm
can be before it can no longer be proven to be a powerful algorithm. This is indicated by the
vertical line separating the region of provable algorithms from the region of Gödel incompleteness.
Note: This diagram is a correction of the published version.

Of course the agent cannot know for certain whether or not some action will lead to rewards
in the future, thus it must predict these. Clearly, at the heart of reinforcement learning lies a
prediction problem, and so the results for computable predictors presented in this paper also
apply to computable reinforcement learners. More specifically, from Theorem 4.3 it follows that
very powerful computable reinforcement learners are necessarily complex, and from Theorem 7.1
it follows that it is impossible to discover extremely powerful reinforcement learning algorithms
mathematically. These relationships are illustrated in Figure 1.
It is reasonable to ask whether the assumptions we have made in our model need to be changed.
If we increase the power of the predictors further, for example by providing them with some kind
of an oracle, this would make the predictors even more unrealistic than they currently are. Clearly
this goes against our goal of finding an elegant, powerful and general prediction theory that is
more realistic in its assumptions than Solomonoff’s incomputable model. On the other hand, if
we weaken our assumptions about the predictors’ resources to make them more realistic, we are
Technical Report No. IDSIA-12-06 12

in effect taking a subset of our current class of predictors. As such, all the same limitations and
problems will still apply, as well as some new ones.
It seems then that the way forward is to further restrict the problem space. One possibility
would be to bound the amount of computation time needed to generate the next symbol in the
sequence. However if we do this without restricting the predictors’ resources then the simple
predictor from Lemma 6.2 easily learns to predict any such sequence and thus the problem of
prediction in the limit has become trivial. Another possibility might be to bound the memory
of the machine used to generate the sequence, however this makes the generator a finite state
machine and thus bounds its computation time, again making the problem trivial.
Perhaps the only reasonable solution would be to add additional restrictions to both the algo-
rithms which generate the sequences to be predicted, and to the predictors. We may also want to
consider not just learnability in the limit, but also how quickly the predictor is able to learn. Of
course we are then facing a much more difficult analysis problem.

Acknowledgements
I would like to thank Marcus Hutter, Alexey Chernov, Daniil Ryabko and Laurent Orseau for
useful discussions and advice during the development of this paper.

References
[1] J. M. Barzdin. Prognostication of automata and functions. Information Processing, 71:81–84,
1972.
[2] C. S. Calude. Information and Randomness. Springer, Berlin, 2nd edition, 2002.
[3] G. J. Chaitin. Gödel’s theorem and information. International Journal of Theoretical Physics,
22:941–954, 1982.
[4] A. P. Dawid. Comment on The impossibility of inductive inference. Journal of the American
Statistical Association, 80(390):340–341, 1985.
[5] M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE
Trans. on Information Theory, 38:1258–1270, 1992.
[6] K. Gödel. Über formal unentscheidbare Sätze der principia mathematica und verwandter
systeme I. Monatshefte für Matematik und Physik, 38:173–198, 1931. [English translation by
E. Mendelsohn: “On undecidable propositions of formal mathematical systems”. In M. Davis,
editor, The undecidable, pages 39–71, New York, 1965. Raven Press, Hewlitt].
[7] E. Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474,
1967.
[8] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, Berlin, 2005. 300 pages, [Link] marcus/ai/[Link].
[9] M. Hutter. On the foundations of universal sequence prediction. In Proc. 3rd Annual Con-
ference on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of
LNCS, pages 408–420. Springer, 2006.
Technical Report No. IDSIA-12-06 13

[10] M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and its applications.


Springer, 2nd edition, 1997.

[11] J. Poland and M. Hutter. Convergence of discrete MDL for sequential prediction. In Proc.
17th Annual Conf. on Learning Theory (COLT’04), volume 3120 of LNAI, pages 300–314,
Banff, 2004. Springer, Berlin.

[12] J. J. Rissanen. Fisher Information and Stochastic Complexity. IEEE Trans. on Information
Theory, 42(1):40–47, January 1996.

[13] R. J. Solomonoff. A formal theory of inductive inference: Part 1 and 2. Inform. Control,
7:1–22, 224–254, 1964.

[14] R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theo-


rems. IEEE Trans. Information Theory, IT-24:422–432, 1978.

[15] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT
Press, 1998.

[16] V. V. V’yugin. Non-stochastic infinite and finite sequences. Theoretical computer science,
207:363–382, 1998.

[17] C. S. Wallace and D. M. Boulton. An information measure for classification. Computer Jrnl.,
11(2):185–194, August 1968.

[18] F.M.J. Willems, Y.M. Shtarkov, and Tj.J. Tjalkens. The context-tree weighting method:
Basic properties. IEEE Transactions on Information Theory, 41(3), 1995.

You might also like