0% found this document useful (0 votes)

38 views14 pages

Universal Prediction Theory Analysis

Uploaded by

zumee.schplancer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views14 pages

Universal Prediction Theory Analysis

Uploaded by

zumee.schplancer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Is there an Elegant Universal

Theory of Prediction?

Shane Legg

Technical Report No. IDSIA-12-06

October 19, 2006

IDSIA / USI-SUPSI
Dalle Molle Institute for Artificial Intelligence
Galleria 2, 6928 Manno, Switzerland

IDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland
(SUPSI), and was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.
Technical Report No. IDSIA-12-06 1

Is there an Elegant Universal

Theory of Prediction? ∗
Shane Legg†
October 19, 2006

Abstract
Solomonoff’s inductive learning model is a powerful, universal and highly elegant theory
of sequence prediction. Its critical flaw is that it is incomputable and thus cannot be used in
practice. It is sometimes suggested that it may still be useful to help guide the development
of very general and powerful theories of prediction which are computable. In this paper it
is shown that although powerful algorithms exist, they are necessarily highly complex. This
alone makes their theoretical analysis problematic, however it is further shown that beyond a
moderate level of complexity the analysis runs into the deeper problem of Gödel incomplete-
ness. This limits the power of mathematics to analyse and study prediction algorithms, and
indeed intelligent systems in general.

1 Introduction
Solomonoff’s model of induction rapidly learns to make optimal predictions for any computable
sequence, including probabilistic ones [13, 14]. It neatly brings together the philosophical princi-
ples of Occam’s razor, Epicurus’ principle of multiple explanations, Bayes theorem and Turing’s
model of universal computation into a theoretical sequence predictor with astonishingly powerful
properties. Indeed the problem of sequence prediction could well be considered solved [9, 8], if it
were not for the fact that Solomonoff’s theoretical model is incomputable.
Among computable theories there exist powerful general predictors, such as the Lempel-Ziv
algorithm [5] and Context Tree Weighting [18], that can learn to predict some complex sequences,
but not others. Some prediction methods, based on the Minimum Description Length principle [12]
or the Minimum Message Length principle [17], can even be viewed as computable approximations
of Solomonoff induction [10]. However in practice their power and generality are limited by
the power of the compression methods employed, as well as having a significantly reduced data
efficiency as compared to Solomonoff induction [11].
Could there exist elegant computable prediction algorithms that are in some sense universal?
Unfortunately this is impossible, as pointed out by Dawid [4]. Specifically, he notes that for any
statistical forecasting system there exist sequences which are not calibrated. Dawid also notes that
a forecasting system for a family of distributions is necessarily more complex than any forecasting
∗ This work was supported by SNF grant 200020-107616.
† shane@[Link]
Technical Report No. IDSIA-12-06 2

system generated from a single distribution in the family. However, he does not deal with the
complexity of the sequences themselves, nor does he make a precise statement in terms of a
specific measure of complexity, such as Kolmogorov complexity. The impossibility of forecasting
has since been developed in considerably more depth by V’yugin [16], in particular he proves that
there is an efficient randomised procedure producing sequences that cannot be predicted (with
high probability) by computable forecasting systems.
In this paper we study the prediction of computable sequences from the perspective of Kol-
mogorov complexity. The central question we look at is the prediction of sequences which have
bounded Kolmogorov complexity. This leads us to a new notion of complexity: rather than the
length of the shortest program able to generate a given sequence, in other words Kolmogorov com-
plexity, we take the length of the shortest program able to learn to predict the sequence. This new
complexity measure has the same fundamental invariance property as Kolmogorov complexity,
and a number of strong relationships between the two measures are proven. However in general
the two may diverge significantly. For example, although a long random string that indefinitely
repeats has a very high Kolmogorov complex, this sequence also has a relatively simple structure
that even a simple predictor can learn to predict.
We then prove that some sequences, however, can only be predicted by very complex predictors.
This implies that very general prediction algorithms, in particular those that can learn to predict
all sequences up to a given Kolmogorov complex, must themselves be complex. This puts an end
to our hope of there being an extremely general and yet relatively simple prediction algorithm. We
then use this fact to prove that although very powerful prediction algorithms exist, they cannot
be mathematically discovered due to Gödel incompleteness. Given how fundamental prediction is
to intelligence, this result implies that beyond a moderate level of complexity the development of
powerful artificial intelligence algorithms can only be an experimental science.

2 Preliminaries
An alphabet A is a finite set of 2 or more elements which are called symbols. In this paper we
will assume a binary alphabet B := {0, 1}, though all the results can easily be generalised to
other alphabets. A string is a finite ordered n-tuple of symbols denoted x := x1 x2 . . . xn where
∀i ∈ {1, . . . , n}, xi ∈ B, or more succinctly, x ∈ Bn . The 0-tuple is denoted S λ and is called the
null string. The expression B≤n has the obvious interpretation, and B∗ := n∈N Bn . The length
lexicographical ordering is a total order on B∗ defined as λ < 0 < 1 < 00 < 01 < 10 < 11 < 000 <
001 < · · ·. A substring of x is defined xj:k := xj xj+1 . . . xk where 1 ≤ j ≤ k ≤ n. By |x| we mean
the length of the string x, for example, |xj:k | = k − j + 1. We will sometimes need to encode a
natural number as a string. Using simple encoding techniques it can be shown that there exists
a computable injective function f : N → B∗ where no string in the range of f is a prefix of any
other, and ∀n ∈ N : |f (n)| ≤ log2 n + 2 log2 log2 n + 1 = O(log n).
Unlike strings which always have finite length, a sequence ω is an infinite list of symbols
x1 x2 x3 . . . ∈ B∞ . Of particular interest to us will be the class of sequences which can be generated
by an algorithm executed on a universal Turing machine:

2.1 Definition. A monotone universal Turing machine U is defined as a universal Turing

machine with one unidirectional input tape, one unidirectional output tape, and some bidirectional
work tapes. Input tapes are read only, output tapes are write only, unidirectional tapes are those
Technical Report No. IDSIA-12-06 3

where the head can only move from left to right. All tapes are binary (no blank symbol) and the
work tapes are initially filled with zeros. We say that U outputs/computes a sequence ω on input
p, and write U(p) = ω, if U reads all of p but no more as it continues to write ω to the output
tape.

We fix U and define U(p, x) by simply using a standard coding technique to encode a program
p along with a string x ∈ B∗ as a single input string for U.

2.2 Definition. A sequence ω ∈ B∞ is a computable binary sequence if there exists a

program q ∈ B that writes ω to a one-way output tape when run on a monotone universal Turing
∗

machine U, that is, ∃q ∈ B∗ : U(q) = ω. We denote the set of all computable sequences by C.

A similar definition for strings is not necessary as all strings have finite length and are therefore
trivially computable.

2.3 Definition. A computable binary predictor is a program p ∈ B∗ that on a universal

Turing machine U computes a total function B∗ → B.

For simplicity of notation we will often write p(x) to mean the function computed by the
program p when executed on U along with the input string x, that is, p(x) is short hand for
U(p, x). Having x1:n as input, the objective of a predictor is for its output, called its prediction,
to match the next symbol in the sequence. Formally we express this by writing p(x1:n ) = xn+1 .
As the algorithmic prediction of incomputable sequences, such as the halting sequence, is
impossible by definition, we only consider the problem of predicting computable sequences. To
simplify things we will assume that the predictor has an unlimited supply of computation time
and storage. We will also make the assumption that the predictor has unlimited data to learn
from, that is, we are only concerned with whether or not a predictor can learn to predict in the
following sense:

2.4 Definition. We say that a predictor p can learn to predict a sequence ω := x1 x2 . . . ∈ B∞

if there exists m ∈ N such that ∀n ≥ m : p(x1:n ) = xn+1 .

The existence of m in the above definition need not be constructive, that is, we might not
know when the predictor will stop making prediction errors for a given sequence, just that this
will occur eventually. This is essentially “next value” prediction as characterised by Barzdin [1],
which follows from Gold’s notion of identifiability in the limit for languages [7].

2.5 Definition. Let P (ω) be the set of T all predictors able to learn to predict ω. Similarly for
sets of sequences S ⊂ B∞ , define P (S) := ω∈S P (ω).

A standard measure of complexity for sequences is the length of the shortest program which
generates the sequence:
2.6 Definition. For any sequence ω ∈ B∞ the monotone Kolmogorov complexity of the
sequence is,
K(ω) := min∗ {|q| : U(q) = ω},
q∈B

where U is a monotone universal Turing machine. If no such q exists, we define K(ω) := ∞.

Technical Report No. IDSIA-12-06 4

It can be shown that this measure of complexity depends on our choice of universal Turing
machine U, but only up to an additive constant that is independent of ω. This is due to the fact
that a universal Turing machine can simulate any other universal Turing machine with a fixed
length program.
In essentially the same way as the definition above we can define the Kolmogorov complexity
of a string x ∈ Bn , written K(x), by requiring that U(q) halts after generating x on the output
tape. For an extensive treatment of Kolmogorov complexity and some of its applications see [10]
or [2].
As many of our results will have the above property of holding within an additive constant
that is independent of the variables in the expression, we will indicate this by placing a small plus
+
above the equality or inequality symbol. For example, f (x) < g(x) means that that ∃c ∈ R, ∀x :
f (x) < g(x) + c. When using standard “Big O” notation this is unnecessary as expressions are
already understood to hold within an independent constant, however for consistency of notation
we will use it in these cases also.

3 Prediction of computable sequences

The most elementary result is that every computable sequence can be predicted by at least one
predictor, and that this predictor need not be significantly more complex than the sequence to be
predicted.
+
3.1 Lemma. ∀ω ∈ C, ∃p ∈ P (ω) : K(p) < K(ω).

Proof. As the sequence ω is computable, there must exist at least one algorithm that generates
ω. Let q be the shortest such algorithm and construct an algorithm p that “predicts” ω as follows:
Firstly the algorithm p reads x1:n to find the value of n, then it runs q to generate x1:n+1 and
returns xn+1 as its prediction. Clearly p perfectly predicts ω and |p| < |q| + c, for some small
constant c that is independent of ω and q. 2

Not only can any computable sequence be predicted, there also exist very simple predictors
able to predict arbitrarily complex sequences:

3.2 Lemma. There exists a predictor p such that ∀n ∈ N, ∃ ω ∈ C : p ∈ P (ω) and K(ω) > n.

Proof. Take a string x such that K(x) = |x| ≥ 2n, and from this define a sequence ω := x0000 . . ..
Clearly K(ω) > n and yet a simple predictor p that always predicts 0 can learn to predict ω. 2

The predictor used in the above proof is very simple and can only “learn” sequences that end
with all 0’s, albeit where the initial string can have arbitrarily high Kolmogorov complexity. It
may seem that this is due to sequences that are initially complex but where the “tail complexity”,
defined lim inf i→∞ K(ωi:∞ ), is zero. This is not the case:

3.3 Lemma. There exists a predictor p such that ∀n ∈ N, ∃ ω ∈ C : p ∈ P (ω) and

lim inf i→∞ K(ωi:∞ ) > n.

Proof. A predictor p for eventually periodic sequences can be defined as follows: On input
ω1:k the predictor goes through the ordered pairs (1, 1), (1, 2), (2, 1), (1, 3), (2, 2), (3, 1), (1, 4), . . .
Technical Report No. IDSIA-12-06 5

checking for each pair (a, b) whether the string ω1:k consists of an initial string of length a followed
by a repeating string of length b. On the first match that is found p predicts that the repeating
string continues, and then p halts. If a + b > k before a match is found, then p outputs a fixed
symbol and halts. Clearly K(p) is a small constant and p will learn to predict any sequence that
is eventually periodic.
For any (m, n) ∈ N2 , let ω := x(y ∗ ) where x ∈ Bm , and y ∈ Bn is a random string, that
is, K(y) = n. As ω is eventually periodic p ∈ P (ω) and also we see that lim inf i→∞ K(ωi:∞ ) =
min{K(ωm+1:∞ ), K(ωm+2:∞ ), . . . , K(ωm+n:∞ )}.
For any k ∈ {1, . . . , n} let qk∗ be the shortest program that can generate ωm+k:∞ . We can define
a halting program qk′ that outputs y where this program consists of qk∗ , n and k. Thus, |qk′ | =
|qk∗ | + O(log n) = K(ωk:∞ ) + O(log n). As n = K(y) ≤ |qk′ |, we see that K(ωk:∞ ) > n − O(log n).
As n and k are arbitrary the result follows. 2

Using a more sophisticated version of this proof it can be shown that there exist predictors
that can learn to predict arbitrary regular or primitive recursive sequences. Thus we might wonder
whether there exists a computable predictor able to learn to predict all computable sequences.
Unfortunately, no universal predictor exists, indeed for every predictor there exists a sequence
which it cannot predict at all:
3.4 Lemma. For any predictor p there constructively exists a sequence ω := x1 x2 . . . ∈ C such
+
that ∀n ∈ N : p(x1:n ) 6= xn+1 and K(ω) < K(p).
Proof. For any computable predictor p there constructively exists a computable sequence ω =
x1 x2 x3 . . . computed by an algorithm q defined as follows: Set x1 = 1 − p(λ), then x2 = 1 − p(x1 ),
then x3 = 1 − p(x1:2 ) and so on. Clearly ω ∈ C and ∀n ∈ N : p(x1:n ) = 1 − xn+1 .
Let p∗ be the shortest program that computes the same function as p and define a sequence
generation algorithm q ∗ based on p∗ using the procedure above. By construction, |q ∗ | = |p∗ | + c
for some constant c that is independent of p∗ . Because q ∗ generates ω, it follows that K(ω) ≤ |q ∗ |.
+
By definition K(p) = |p∗ | and so K(ω) < K(p). 2

Allowing the predictor to be probabilistic does not fundamentally avoid the problem of
Lemma 3.4. In each step, rather than generating the opposite to what will be predicted by p,
instead q attempts to generate the symbol which p is least likely
to predict given x1:n . To do this
q must simulate p in order to estimate Pr p(x1:n ) = 1 x1:n . With sufficient simulation effort, q
can estimate this probability to any desired accuracyfor any x1:n . This produces a computable
sequence ω such that ∀n ∈ N : Pr p(x1:n ) = xn+1 x1:n is not significantly greater than 12 , that is,
the performance of p is no better than a predictor that makes completely random predictions.
As probabilistic prediction complicates things without avoiding this fundamental problem, in
the remainder of this paper we will consider only deterministic predictors. This will also allow
us to see the roots of this problem as clearly as possible. With the preliminaries covered, we
now move on to the central problem considered in this paper: Predicting sequences of limited
Kolmogorov complexity.

4 Prediction of simple computable sequences

As the computable prediction of any computable sequence is impossible, a weaker goal is to be
able to predict all “simple” computable sequences.
Technical Report No. IDSIA-12-06 6

4.1 Definition. For n ∈ N, let Cn := {ω ∈ C : K(ω) ≤ n}. Further, let Pn := P (Cn ) be the set
of predictors able to learn to predict all sequences in Cn .

Firstly we establish that prediction algorithms exist that can learn to predict all sequences up
to a given complexity, and that these predictors need not be significantly more complex than the
sequences they can predict:
+
4.2 Lemma. ∀n ∈ N, ∃p ∈ Pn : K(p) < n + O(log n).

Proof. Let h ∈ N be the number of programs of length n or less which generate infinite sequences.
Build the value of h into a prediction algorithm p constructed as follows:
In the k th prediction cycle run in parallel all programs of length n or less until h of these
programs have each produced k + 1 symbols of output. Next predict according to the k + 1th
symbol of the generated string whose first k symbols is consistent with the observed string. If two
generated strings are consistent with the observed sequence (there cannot be more than two as the
strings are binary and have length k + 1), pick the one which was generated by the program that
occurs first in a lexicographical ordering of the programs. If no generated output is consistent,
give up and output a fixed symbol.
For sufficiently large k, only the h programs which produce infinite sequences will produce
output strings of length k + 1. As this set of sequences is finite, they can be uniquely identified
by finite initial strings. Thus for sufficiently large k the predictor p will correctly predict any
computable sequence ω for which K(ω) ≤ n, that is, p ∈ Pn .
As there are 2n+1 − 1 possible strings of length n or less, h < 2n+1 and thus we can encode h
with log2 h + 2 log2 log2 h = n + 1 + 2 log2 (n + 1) bits. Thus, K(p) < n + 1 + 2 log2 (n + 1) + c for
some constant c that is independent of n. 2

Can we do better than this? Lemmas 3.2 and 3.3 shows us that there exist predictors able to
predict at least some sequences vastly more complex than themselves. This suggests that there
might exist simple predictors able to predict arbitrary sequences up to a high complexity. Formally,
could there exist p ∈ Pn where n ≫ K(p)? Unfortunately, these simple but powerful predictors
are not possible:
+
4.3 Theorem. ∀n ∈ N : p ∈ Pn ⇒ K(p) > n.

Proof. For any n ∈ N let p ∈ Pn , that is, ∀ω ∈ Cn : p ∈ P (ω). By Lemma 3.4 we know that
∃ ω′ ∈ C : p ∈
/ P (ω ′ ) . As p ∈
/ P (ω ′ ) it must be the case that ω ′ ∈
/ Cn , that is, K(ω ′ ) ≥ n. From
+
′
Lemma 3.4 we also know that K(p) > K(ω ) and so the result follows. 2

Intuitively the reason for this is as follows: Lemma 3.4 guarantees that every simple predictor
fails for at least one simple sequence. Thus if we want a predictor that can learn to predict all
sequences up to a moderate level of complexity, then clearly the predictor cannot be simple. Like-
wise, if we want a predictor that can predict all sequences up to a high level of complexity, then the
predictor itself must be very complex. Thus, even though we have made the generous assumption
of unlimited computational resources and data to learn from, only very complex algorithms can
be truly powerful predictors.
These results easily generalise to notions of complexity that take computation time into consid-
eration. As sequences are infinite, the appropriate measure of time is the time needed to generate
Technical Report No. IDSIA-12-06 7

or predict the next symbol in the sequence. Under any reasonable measure of time complexity,
the operation of inverting a single output from a binary valued function can be performed with
little cost. If C is any complexity measure with this property, it is trivial to see that the proof of
Lemma 3.4 still holds for C. From this, an analogue of Theorem 4.3 for C easily follows.
With similar arguments these results also generalise in a straightforward way to complexity
measures that take space or other computational resources into account. Thus, the fact that
extremely powerful predictors must be very complex, holds under any measure of complexity for
which inverting a single bit is inexpensive.

5 Complexity of prediction
Another way of viewing these results is in terms of an alternate notion of sequence complexity
defined as the size of the smallest predictor able to learn to predict the sequence. This allows us
to express the results of the previous sections more concisely. Formally, for any sequence ω define
the complexity measure,
K̇(ω) := min∗ {|p| : p ∈ P (ω)},
p∈B

and K̇(ω) := ∞ if P (ω) = ∅. Thus, if K̇(ω) is high then the sequence ω is complex in the sense
that only complex prediction algorithms are able to learn to predict it. It can easily be seen
that this notion of complexity has the same invariance to the choice of reference universal Turing
machine as the standard Kolmogorov complexity measure.
It may be tempting to conjecture that this definition simply describes what might be called
the “tail complexity” of a sequence, that is, K̇(ω) is equal to lim inf i→∞ K(ωi:∞ ). This is not the
case. In the proof of Lemma 3.3 we saw that there exists a single predictor capable of learning
to predict any sequence that consists of a repeating string, and thus for these sequences K̇ is
bounded. It was further shown that there exist sequences of this form with arbitrarily high tail
complexity. Clearly then tail complexity and K̇ cannot be equal in general.
Using K̇ we can now rewrite a number of our previous results much more succinctly. From
Lemma 3.1 it immediately follows that,
+
∀ω : 0 ≤ K̇(ω) < K(ω).
From Lemma 3.2 we know that ∃c ∈ N, ∀n ∈ N, ∃ ω ∈ C such that K̇(ω) < c and K(ω) > n, that
is, K̇ can attain the lower bound above within a small constant, no matter how large the value of
K is. The sequences for which the upper bound on K̇ is tight are interesting as they are the ones
which demand complex predictors. We prove the existence of these sequences and look at some of
their properties in the next section.
The complexity measure K̇ can also be generalised to sets of sequences, for S ⊂ B∞ define
K̇(S) := minp {|p| : p ∈ P (S)}. This allows us to rewrite Lemma 4.2 and Theorem 4.3 as simply,
+ +
∀n ∈ N : n < K̇(Cn ) < n + O(log n).
This is just a restatement of the fact that the simplest predictor capable of predicting all sequences
up to a Kolmogorov complexity of n, has itself a Kolmogorov complexity of roughly n.
Perhaps the most surprising thing about K̇ complexity is that this very natural definition of
the complexity of a sequence, as viewed from the perspective of prediction, does not appear to
have been studied before.
Technical Report No. IDSIA-12-06 8

6 Hard to predict sequences

We have already seen that some individual sequences, such as the repeating string used in the
proof of Lemma 3.3, can have arbitrarily high Kolmogorov complexity but nevertheless can be
predicted by trivial algorithms. Thus, although these sequences contain a lot of information in
the Kolmogorov sense, in a deeper sense their structure is very simple and easily learnt.
What interests us in this section is the other extreme; individual sequences which can only be
predicted by complex predictors. As we are only concerned with prediction in the limit, this extra
complexity in the predictor must be some kind of special information which cannot be learnt just
through observing the sequence. Our first task is to show that these difficult to predict sequences
exist.
+ + +
6.1 Theorem. ∀n ∈ N, ∃ ω ∈ C : n < K̇(ω) < K(ω) < n + O(log n).

Proof. For any n ∈ N, let Qn ⊂ B<n be the set of programs shorter than n that are predictors,
and let x1:k ∈ Bk be the observed initial string from the sequence ω which is to be predicted. Now
construct a meta-predictor p̂:
By dovetailing the computations, run in parallel every program of length less than n on every
string in B≤k . Each time a program is found to halt on all of these input strings, add the program
to a set of “candidate prediction algorithms”, called Q̃kn . As each element of Qn is a valid predictor,
and thus halts for all input strings in B∗ by definition, for every n and k it eventually will be the
case that |Q̃kn | = |Qn |. At this point the simulation to approximate Qn terminates. It is clear that
for sufficiently large values of k all of the valid predictors, and only the valid predictors, will halt
with a single symbol of output on all tested input strings. That is, ∃r ∈ N, ∀k > r : Q̃kn = Qn .
The second part of the p̂ algorithmPuses these candidate prediction algorithms to make a
k−1
prediction. For p ∈ Q̃kn define dk (p) := i=1 |p(x1:i ) − xi+1 |. Informally, dk (p) is the number of
prediction errors made by p so far. Compute this for all p ∈ Q̃kn and then let p∗k ∈ Q̃kn be the
program with minimal dk (p). If there is more than one such program, break the tie by letting p∗k
be the lexicographically first of these. Finally, p̂ computes the value of p∗k (x1:k ) and then returns
this as its prediction and halts.
By Lemma 3.4, there exists ω ′ ∈ C such that p̂ makes a prediction error for every k when trying
to predict ω ′ . Thus, in each cycle at least one of the finitely many predictors with minimal dk
makes a prediction error and so ∀p ∈ Qn : dk (p) → ∞ as k → ∞. Therefore, ∄p ∈ Qn : p ∈ P (ω ′ ),
that is, no program of length less than n can learn to predict ω ′ and so n ≤ K̇(ω ′ ). Further, from
+ +
Lemma 3.1 we know that K̇(ω ′ ) < K(ω ′ ), and from Lemma 3.4 again, K(ω ′ ) < K(p̂).
Examining the algorithm for p̂, we see that it contains some fixed length program code and
an encoding of |Qn |, where |Qn | < 2n − 1. Thus, using a standard encoding method for integers,
+
K(p̂) < n + O(log n).
+ + + +
Chaining these together we get, n < K̇(ω ′ ) < K(ω ′ ) < K(p̂) < n + O(log n), which proves the
theorem. 2

This establishes the existence of sequences with arbitrarily high K̇ complexity which also have
a similar level of Kolmogorov complexity. Next we establish a fundamental property of high K̇
complexity sequences: they are extremely difficult to compute.
For an algorithm q that generates ω ∈ C, define tq (n) to be the number of computation steps
performed by q before the nth symbol of ω is written to the output tape. For example, if q is a
Technical Report No. IDSIA-12-06 9

simple algorithm that outputs the sequence 010101 . . ., then clearly tq (n) = O(n) and so ω can
be computed quickly. The following theorem proves that if a sequence can be computed in a
reasonable amount of time, then the sequence must have a low K̇ complexity:
+
6.2 Lemma. ∀ω ∈ C, if ∃q : U(q) = ω and ∃r ∈ N, ∀n > r : tq (n) < 2n , then K̇(ω) = 0.

Proof. Construct a prediction algorithm p̃ as follows:

On input x1:n , run all programs of length n or less, each for 2n+1 steps. In a set Wn collect
together all generated strings which are at least n + 1 symbols long and where the first n symbols
match the observed string x1:n . Now order the strings in Wn according to a lexicographical
ordering of their generating programs. If Wn = ∅, then just return a prediction of 1 and halt. If
|Wn | > 1 then return the n + 1th symbol from the first sequence in the above ordering.
Assume that ∃q : U(q) = ω such that ∃r ∈ N, ∀n > r : tq (n) < 2n . If q is not unique, take
q to be the lexicographically first of these. Clearly ∀n > r the initial string from ω generated
by q will be in the set Wn . As there is no lexicographically lower program which can generate ω
within the time constraint tq (n) < 2n for all n > r, for sufficiently large n the predictor p̃ must
converge on using q for each prediction and thus p̃ ∈ P (ω). As |p̃| is clearly a fixed constant that
+
is independent of ω, it follows then that K̇(ω) < |p̃| = 0. 2

We could replace the 2n bound in the above result with any monotonically growing computable
n
function, for example, 22 . In any case, this does not change the fundamental result that sequences
which have a high K̇ complexity are practically impossible to compute. However from our theoret-
ical perspective these sequences present no problem as they can be predicted, albeit with immense
difficulty.

7 The limits of mathematical analysis

One way to interpret the results of the previous sections is in terms of constructive theories of
prediction. Essentially, a constructive theory of prediction T , expressed in some sufficiently rich
formal system F, is in effect a description of a prediction algorithm with respect to a universal
Turing machine which implements the required parts of F. Thus from Theorems 4.3 and 6.1
it follows that if we want to have a predictor that can learn to predict all sequences up to a
high level of Kolmogorov complexity, or even just predict individual sequences which have high
K̇ complexity, the constructive theory of prediction that we base our predictor on must be very
complex. Elegant and highly general constructive theories of prediction simply do not exist, even
if we assume unlimited computational resources. This is in marked contrast to Solomonoff’s highly
elegant but non-constructive theory of prediction.
Naturally, highly complex theories of prediction will be very difficult to mathematically analyse,
if not practically impossible. Thus at some point the development of very general prediction
algorithms must become mainly an experimental endeavour due to the difficulty of working with
the required theory. Interestingly, an even stronger result can be proven showing that beyond
some point the mathematical analysis is in fact impossible, even in theory:

7.1 Theorem. In any consistent formal axiomatic system F that is sufficiently rich to express
statements of the form “p ∈ Pn ”, there exists m ∈ N such that for all n > m and for all predictors
p ∈ Pn the true statement “p ∈ Pn ” cannot be proven in F.
Technical Report No. IDSIA-12-06 10

In other words, even though we have proven that very powerful sequence prediction algorithms
exist, beyond a certain complexity it is impossible to find any of these algorithms using mathe-
matics. The proof has a similar structure to Chaitin’s information theoretic proof [3] of Gödel
incompleteness theorem for formal axiomatic systems [6].
Proof. For each n ∈ N let Tn be the set of statements expressed in the formal system F of
the form “p ∈ Pn ”, where p is filled in with the complete description of some algorithm in each
case. As the set of programs is denumerable, Tn is also denumerable and each element of Tn has
finite length. From Lemma 4.2 and Theorem 4.3 it follows that each Tn contains infinitely many
statements of the form “p ∈ Pn ” which are true.
Fix n and create a search algorithm s that enumerates all proofs in the formal system F
searching for a proof of a statement in the set Tn . As the set Tn is recursive, s can always
recognise a proof of a statement in Tn . If s finds any such proof, it outputs the corresponding
program p and then halts.
By way of contradiction, assume that s halts, that is, a proof of a theorem in Tn is found and
p such that p ∈ Pn is generated as output. The size of the algorithm s is a constant (a description
of the formal system F and some proof enumeration code) as well as an O(log n) term needed
+
to describe n. It follows then that K(p) < O(log n). However from Theorem 4.3 we know that
+
K(p) > n. Thus, for sufficiently large n, we have a contradiction and so our assumption of the
existence of a proof must be false. That is, for sufficiently large n and for all p ∈ Pn , the true
statement “p ∈ Pn ” cannot be proven within the formal system F. 2

The exact value of m depends on our choice of formal system F and which reference machine
U we measure complexity with respect to. However for reasonable choices of F and U the value of
m would be in the order of 1000. That is, the bound m is certainly not so large as to be vacuous.

8 Discussion
Solomonoff induction is an elegant and extremely general model of inductive learning. It neatly
brings together the philosophical principles of Occam’s razor, Epicurus’ principle of multiple expla-
nations, Bayes theorem and Turing’s model of universal computation into a theoretical sequence
predictor with astonishingly powerful properties. If theoretical models of prediction can have such
elegance and power, one cannot help but wonder whether similarly beautiful and highly general
computable theories of prediction are also possible.
What we have shown here is that there does not exist an elegant constructive theory of predic-
tion for computable sequences, even if we assume unbounded computational resources, unbounded
data and learning time, and place moderate bounds on the Kolmogorov complexity of the sequences
to be predicted. Very powerful computable predictors are therefore necessarily complex. We have
further shown that the source of this problem is computable sequences which are extremely expen-
sive to compute. While we have proven that very powerful prediction algorithms which can learn
to predict these sequences exist, we have also proven that, unfortunately, mathematical analysis
cannot be used to discover these algorithms due to problems of Gödel incompleteness.
These results can be extended to more general settings, specifically to those problems which
are equivalent to, or depend on, sequence prediction. Consider, for example, a reinforcement
learning agent interacting with an environment [15, 8]. In each interaction cycle the agent must
choose its actions so as to maximise the future rewards that it receives from the environment.
Technical Report No. IDSIA-12-06 11

Figure 1: Theorem 4.3 rules out simple but powerful artificial intelligence algorithms, as indicated
by the greyed out region on the lower right. Theorem 7.1 upper bounds how powerful an algorithm
can be before it can no longer be proven to be a powerful algorithm. This is indicated by the
vertical line separating the region of provable algorithms from the region of Gödel incompleteness.
Note: This diagram is a correction of the published version.

Of course the agent cannot know for certain whether or not some action will lead to rewards
in the future, thus it must predict these. Clearly, at the heart of reinforcement learning lies a
prediction problem, and so the results for computable predictors presented in this paper also
apply to computable reinforcement learners. More specifically, from Theorem 4.3 it follows that
very powerful computable reinforcement learners are necessarily complex, and from Theorem 7.1
it follows that it is impossible to discover extremely powerful reinforcement learning algorithms
mathematically. These relationships are illustrated in Figure 1.
It is reasonable to ask whether the assumptions we have made in our model need to be changed.
If we increase the power of the predictors further, for example by providing them with some kind
of an oracle, this would make the predictors even more unrealistic than they currently are. Clearly
this goes against our goal of finding an elegant, powerful and general prediction theory that is
more realistic in its assumptions than Solomonoff’s incomputable model. On the other hand, if
we weaken our assumptions about the predictors’ resources to make them more realistic, we are
Technical Report No. IDSIA-12-06 12

in effect taking a subset of our current class of predictors. As such, all the same limitations and
problems will still apply, as well as some new ones.
It seems then that the way forward is to further restrict the problem space. One possibility
would be to bound the amount of computation time needed to generate the next symbol in the
sequence. However if we do this without restricting the predictors’ resources then the simple
predictor from Lemma 6.2 easily learns to predict any such sequence and thus the problem of
prediction in the limit has become trivial. Another possibility might be to bound the memory
of the machine used to generate the sequence, however this makes the generator a finite state
machine and thus bounds its computation time, again making the problem trivial.
Perhaps the only reasonable solution would be to add additional restrictions to both the algo-
rithms which generate the sequences to be predicted, and to the predictors. We may also want to
consider not just learnability in the limit, but also how quickly the predictor is able to learn. Of
course we are then facing a much more difficult analysis problem.

Acknowledgements
I would like to thank Marcus Hutter, Alexey Chernov, Daniil Ryabko and Laurent Orseau for
useful discussions and advice during the development of this paper.

References
[1] J. M. Barzdin. Prognostication of automata and functions. Information Processing, 71:81–84,
1972.
[2] C. S. Calude. Information and Randomness. Springer, Berlin, 2nd edition, 2002.
[3] G. J. Chaitin. Gödel’s theorem and information. International Journal of Theoretical Physics,
22:941–954, 1982.
[4] A. P. Dawid. Comment on The impossibility of inductive inference. Journal of the American
Statistical Association, 80(390):340–341, 1985.
[5] M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE
Trans. on Information Theory, 38:1258–1270, 1992.
[6] K. Gödel. Über formal unentscheidbare Sätze der principia mathematica und verwandter
systeme I. Monatshefte für Matematik und Physik, 38:173–198, 1931. [English translation by
E. Mendelsohn: “On undecidable propositions of formal mathematical systems”. In M. Davis,
editor, The undecidable, pages 39–71, New York, 1965. Raven Press, Hewlitt].
[7] E. Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474,
1967.
[8] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, Berlin, 2005. 300 pages, [Link] marcus/ai/[Link].
[9] M. Hutter. On the foundations of universal sequence prediction. In Proc. 3rd Annual Con-
ference on Theory and Applications of Models of Computation (TAMC’06), volume 3959 of
LNCS, pages 408–420. Springer, 2006.
Technical Report No. IDSIA-12-06 13

[10] M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and its applications.

Springer, 2nd edition, 1997.

[11] J. Poland and M. Hutter. Convergence of discrete MDL for sequential prediction. In Proc.
17th Annual Conf. on Learning Theory (COLT’04), volume 3120 of LNAI, pages 300–314,
Banff, 2004. Springer, Berlin.

[12] J. J. Rissanen. Fisher Information and Stochastic Complexity. IEEE Trans. on Information
Theory, 42(1):40–47, January 1996.

[13] R. J. Solomonoff. A formal theory of inductive inference: Part 1 and 2. Inform. Control,
7:1–22, 224–254, 1964.

[14] R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theo-

rems. IEEE Trans. Information Theory, IT-24:422–432, 1978.

[15] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT
Press, 1998.

[16] V. V. V’yugin. Non-stochastic infinite and finite sequences. Theoretical computer science,
207:363–382, 1998.

[17] C. S. Wallace and D. M. Boulton. An information measure for classification. Computer Jrnl.,
11(2):185–194, August 1968.

[18] F.M.J. Willems, Y.M. Shtarkov, and Tj.J. Tjalkens. The context-tree weighting method:
Basic properties. IEEE Transactions on Information Theory, 41(3), 1995.

Computing Chaitin Omega Values
No ratings yet
Computing Chaitin Omega Values
16 pages
Sorting Algorithms and Kolmogorov Complexity
No ratings yet
Sorting Algorithms and Kolmogorov Complexity
19 pages
Kolmogorov Complexity Overview
No ratings yet
Kolmogorov Complexity Overview
28 pages
Overview of Kolmogorov Complexity
No ratings yet
Overview of Kolmogorov Complexity
8 pages
Understanding Nyambo Meaning
No ratings yet
Understanding Nyambo Meaning
13 pages
Computing Chaitin Omega Number Bits
No ratings yet
Computing Chaitin Omega Number Bits
10 pages
Computing Chaitin Omega Number Bits
No ratings yet
Computing Chaitin Omega Number Bits
10 pages
Human-Level AI by 2035: Predictions and Impacts
No ratings yet
Human-Level AI by 2035: Predictions and Impacts
10 pages
Proof of the Extended Church-Turing Thesis
No ratings yet
Proof of the Extended Church-Turing Thesis
15 pages
Understanding Probabilistic Turing Transducers
No ratings yet
Understanding Probabilistic Turing Transducers
9 pages
Beyond the Turing Limit: Computability Insights
No ratings yet
Beyond the Turing Limit: Computability Insights
10 pages
Theory of Computation Overview
No ratings yet
Theory of Computation Overview
9 pages
Derandomizing Space-Bounded Computation
No ratings yet
Derandomizing Space-Bounded Computation
25 pages
Computability and Complexity Theory Overview
No ratings yet
Computability and Complexity Theory Overview
60 pages
Lecture Notes on Algorithmic Information Theory
No ratings yet
Lecture Notes on Algorithmic Information Theory
68 pages
Introduction to Computational Complexity
No ratings yet
Introduction to Computational Complexity
23 pages
Kolmogorov Complexity and Randomness
No ratings yet
Kolmogorov Complexity and Randomness
8 pages
Algorithmic Theories of Everything
No ratings yet
Algorithmic Theories of Everything
50 pages
Learning Probability Distributions via Finite-State Machines
No ratings yet
Learning Probability Distributions via Finite-State Machines
29 pages
Aspectos de Aleatoriedad en Computación
No ratings yet
Aspectos de Aleatoriedad en Computación
143 pages
Complete Theory of Everything Explained
No ratings yet
Complete Theory of Everything Explained
22 pages
Turing Incomputable Computation Model
No ratings yet
Turing Incomputable Computation Model
26 pages
Tighter Bounds on Transformer Expressivity
No ratings yet
Tighter Bounds on Transformer Expressivity
19 pages
Algorithmic Information Theory: G. J. Chaitin
No ratings yet
Algorithmic Information Theory: G. J. Chaitin
10 pages
Algorithmic Information Theory: G. J. Chaitin
No ratings yet
Algorithmic Information Theory: G. J. Chaitin
10 pages
Meaningful Information in Algorithmic Statistics
No ratings yet
Meaningful Information in Algorithmic Statistics
11 pages
Probabilistic Polynomial Time Complexity
No ratings yet
Probabilistic Polynomial Time Complexity
8 pages
Probabilistic Turing Transducers Explained
No ratings yet
Probabilistic Turing Transducers Explained
9 pages
Complexity Theory of Quantum Computing
No ratings yet
Complexity Theory of Quantum Computing
13 pages
Turing and Oracle Turing Machines Explained
No ratings yet
Turing and Oracle Turing Machines Explained
8 pages
Overview of Probabilistic Turing Machines
No ratings yet
Overview of Probabilistic Turing Machines
18 pages
Meta-Learning Universal Predictors with SI
No ratings yet
Meta-Learning Universal Predictors with SI
32 pages
Overview of Computation Theory
No ratings yet
Overview of Computation Theory
17 pages
Turing Machines and Computability Concepts
No ratings yet
Turing Machines and Computability Concepts
18 pages
Understanding Church's Thesis and Turing Machines
No ratings yet
Understanding Church's Thesis and Turing Machines
5 pages
Kolmogorov Complexity in Metacomplexity
No ratings yet
Kolmogorov Complexity in Metacomplexity
8 pages
Future of Algorithm Design Strategies
No ratings yet
Future of Algorithm Design Strategies
9 pages
Understanding Computable Numbers
No ratings yet
Understanding Computable Numbers
3 pages
Automata Theory: Decidability & Complexity
No ratings yet
Automata Theory: Decidability & Complexity
12 pages
Quantum Turing Automata and Complexity
No ratings yet
Quantum Turing Automata and Complexity
6 pages
Turing Machines: Concepts and Limits
No ratings yet
Turing Machines: Concepts and Limits
5 pages
Computational Complexity Overview
No ratings yet
Computational Complexity Overview
8 pages
Limits of Hypercomputation Explained
No ratings yet
Limits of Hypercomputation Explained
16 pages
Lower Bound for PDFA Learning Complexity
No ratings yet
Lower Bound for PDFA Learning Complexity
15 pages
Understanding Undecidability in Computation
No ratings yet
Understanding Undecidability in Computation
13 pages
Defining Artificial Intelligence Strategies
No ratings yet
Defining Artificial Intelligence Strategies
20 pages
Turing Machines and Non-Recursive Languages
No ratings yet
Turing Machines and Non-Recursive Languages
3 pages
Understanding the Church-Turing Thesis
No ratings yet
Understanding the Church-Turing Thesis
2 pages
Turing Machine: Foundations and Implications
No ratings yet
Turing Machine: Foundations and Implications
2 pages
Introduction to Kolmogorov Complexity
No ratings yet
Introduction to Kolmogorov Complexity
31 pages
Kolmogorov Complexity Overview
No ratings yet
Kolmogorov Complexity Overview
31 pages
Kolmogorov Complexity Explained
No ratings yet
Kolmogorov Complexity Explained
13 pages
Decidability and Complexity in Automata Theory
No ratings yet
Decidability and Complexity in Automata Theory
23 pages
Understanding the Berry Paradox in Python
No ratings yet
Understanding the Berry Paradox in Python
6 pages
Turing Machines: Computation Theory Explained
No ratings yet
Turing Machines: Computation Theory Explained
24 pages
Understanding Decidability and Algorithms
No ratings yet
Understanding Decidability and Algorithms
34 pages
Factor Oracles in Machine Improvisation
No ratings yet
Factor Oracles in Machine Improvisation
12 pages
On Effective Procedures For Speeding Up Algorithms
No ratings yet
On Effective Procedures For Speeding Up Algorithms
16 pages
Sheet Plastics and Their Applications in Orthotics and Prosthetics
No ratings yet
Sheet Plastics and Their Applications in Orthotics and Prosthetics
8 pages
Paper Idioms Phases
No ratings yet
Paper Idioms Phases
24 pages
First Kyu: A Go Player's Journey
No ratings yet
First Kyu: A Go Player's Journey
202 pages
Slang Terms for Sexual Acts and Insults
No ratings yet
Slang Terms for Sexual Acts and Insults
17 pages
Bonkura and Fingerless Insights
No ratings yet
Bonkura and Fingerless Insights
8 pages
Bhagavad Gita Chapter 2 Summary
No ratings yet
Bhagavad Gita Chapter 2 Summary
384 pages
Business Intelligence User Types & Reports
No ratings yet
Business Intelligence User Types & Reports
78 pages
North African Lesbian Poetry Overview
No ratings yet
North African Lesbian Poetry Overview
18 pages
Cristian Sandu's Software Engineer CV
No ratings yet
Cristian Sandu's Software Engineer CV
6 pages
Understanding Daniel and Revelation
No ratings yet
Understanding Daniel and Revelation
2 pages
ABAP Program Memory Management Insights
100% (1)
ABAP Program Memory Management Insights
34 pages
10 Tips for Effective Dashboard Design
No ratings yet
10 Tips for Effective Dashboard Design
13 pages
Believer's Principles and Qualities
No ratings yet
Believer's Principles and Qualities
10 pages
CG Fun: Engaging Outreach Activities
No ratings yet
CG Fun: Engaging Outreach Activities
39 pages
Models - Llsimulink.battery Control Thermal
No ratings yet
Models - Llsimulink.battery Control Thermal
12 pages
Torment Vulnhub CTF Walkthrough
No ratings yet
Torment Vulnhub CTF Walkthrough
24 pages
Teaching English in Japan: A Guide for Nigerians
No ratings yet
Teaching English in Japan: A Guide for Nigerians
4 pages
shadPS4 Emulator CMake Configuration
No ratings yet
shadPS4 Emulator CMake Configuration
24 pages
Exploiting Pandora FMS on Linux
0% (1)
Exploiting Pandora FMS on Linux
21 pages
Unconditionality in Cultural Theology
No ratings yet
Unconditionality in Cultural Theology
11 pages
Holman Concise Bible Commentary Guide
No ratings yet
Holman Concise Bible Commentary Guide
1 page
Intensive Latin Workshop Plan
No ratings yet
Intensive Latin Workshop Plan
7 pages
ChatGPT Prompts for Every Need
No ratings yet
ChatGPT Prompts for Every Need
194 pages
AWS Solutions Architect Exam Q&As
No ratings yet
AWS Solutions Architect Exam Q&As
5 pages
Seeley's Political Science Lectures
No ratings yet
Seeley's Political Science Lectures
408 pages
Voices and Speech Forms Explained
No ratings yet
Voices and Speech Forms Explained
3 pages
Word Formation Processes in English
No ratings yet
Word Formation Processes in English
26 pages
Post Meridiem in Marathi Explained
No ratings yet
Post Meridiem in Marathi Explained
3 pages
Grade 12 Textual Editing Guide
100% (3)
Grade 12 Textual Editing Guide
21 pages
Mass Song Line-up for May 1, 2011
No ratings yet
Mass Song Line-up for May 1, 2011
4 pages
ANSI C Programming Test Bank
No ratings yet
ANSI C Programming Test Bank
16 pages
Indian Scroll Paintings Explained
No ratings yet
Indian Scroll Paintings Explained
10 pages
Combinational Logic Design Overview
100% (1)
Combinational Logic Design Overview
145 pages
Chord Lagu Populer dan Punk Rock
No ratings yet
Chord Lagu Populer dan Punk Rock
38 pages
Conditional Clauses Exercise Guide
No ratings yet
Conditional Clauses Exercise Guide
5 pages

Universal Prediction Theory Analysis

Uploaded by

Universal Prediction Theory Analysis

Uploaded by

Is there an Elegant Universal

Technical Report No. IDSIA-12-06

Is there an Elegant Universal

2.1 Definition. A monotone universal Turing machine U is defined as a universal Turing

2.2 Definition. A sequence ω ∈ B∞ is a computable binary sequence if there exists a

2.3 Definition. A computable binary predictor is a program p ∈ B∗ that on a universal

2.4 Definition. We say that a predictor p can learn to predict a sequence ω := x1 x2 . . . ∈ B∞

where U is a monotone universal Turing machine. If no such q exists, we define K(ω) := ∞.

3 Prediction of computable sequences

3.3 Lemma. There exists a predictor p such that ∀n ∈ N, ∃ ω ∈ C : p ∈ P (ω) and

4 Prediction of simple computable sequences

6 Hard to predict sequences

Proof. Construct a prediction algorithm p̃ as follows:

7 The limits of mathematical analysis

[10] M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and its applications.

[14] R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theo-

You might also like