effective_complexity
effective_complexity
Murray Gell-Mann
Seth Lloyd
SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the
views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or
proceedings volumes, but not papers that have already appeared in print. Except for papers by our external
faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or
funded by an SFI grant.
©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure
timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights
therein are maintained by the author(s). It is understood that all persons copying this information will
adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only
with the explicit permission of the copyright holder.
www.santafe.edu
SANTA FE INSTITUTE
Santa Fe Institute. June 26, 2003 12:52 p.m. Gell-Mann page 387
Effective Complexity
Murray Gell-Mann
Seth Lloyd
length. Such strings are sometimes called “random” strings, although the termi-
nology does not agree precisely with the usual meaning of random (stochastic,
especially with equal probabilities for all alternatives). Some authors call AIC
“algorithmic complexity,” but it is not properly a measure of complexity, since
randomness is not what we usually mean when we speak of complexity. Another
name for AIC, “algorithmic randomness,” is somewhat more apt.
Now we can begin to construct a technical definition of effective complexity,
using AIC (or something very like it) as a minimum description length. We split
the AIC of the string representing the entity into two terms, one for regularities
and the other for features treated as random or incidental. The first term is then
the effective complexity, the minimum description length of the regularities of
the entity [8].
It is not enough to define EC as the AIC of the regularities of an entity.
We must still examine how the regularities are described and distinguished from
features treated as random, using the judgment of what is important. One of
the best ways to exhibit regularities is the method used in statistical mechanics,
say, for a classical sample of a pure gas. The detailed description of the positions
and momenta of all the molecules is obviously too much information to gather,
store, retrieve, or interpret. Instead, certain regularities are picked out. The
entity considered—the real sample of gas—is embedded conceptually in a set
of comparable samples, where the others are all imagined rather than real. The
members of the set are assigned probabilities, so that we have an ensemble. The
entity itself must be a typical member of the ensemble (in other words, not one
with abnormally low probability). The set and its probability distribution will
then reflect the regularities.
For extensive systems, the statistical-mechanical methods of Boltzmann and
Gibbs, when described in modern language, amount to using the principle of max-
imum ignorance, as emphasized by Jaynes [9]. The ignorance measure or Shannon
information I is introduced. (With a multiplicative constant, I is the entropy.)
Then the probabilities in the ensemble are varied and I is maximized subject to
keeping fixed certain average quantities over the ensemble. For example, if the
average energy is kept fixed—and nothing else—the Maxwell-Boltzmann distri-
bution of probabilities results.
We have, of course,
X
I=− Pr log Pr , (1)
r
where log means logarithm to the base 2 and the P ’s are the (coarse-grained)
probabilities for the individual members r of the ensemble. The multiplicative
constant that yields entropy is k ln 2, where k is Boltzmann’s constant.
In this situation, with one real member of the ensemble and the rest imag-
ined, the fine-grained probabilities are all zero for the members of the ensemble
other than e, the entity under consideration (or the bit string describing it).
Of course, the fine-grained probability of e is unity. The typicality condition
Santa Fe Institute. June 26, 2003 12:52 p.m. Gell-Mann page 390
where ≈ means equal to within a few bits (here actually one bit).
Let us define the total information Σ as the sum of Y and I. The first term
is, of course, the AIC of the ensemble and we have seen that the second is, to
within a bit, the average contingent AIC of the members given the ensemble.
To throw some light on the role of the total information, consider the situa-
tion of a theoretical scientist trying to construct a theory to account for a large
body of data. Suppose the theory can be represented as a probability distribu-
tion over a set of bodies of data, one of which consists of the real data and the
rest of which are imagined. Then Y corresponds to the complexity of the theory
and I measures the extent to which the predictions of the theory are distributed
widely over different possible bodies of data. Ideally, the theorist would like both
Santa Fe Institute. June 26, 2003 12:52 p.m. Gell-Mann page 391
quantities to be small, the first so as to make the theory simple and the second so
as to make it focus narrowly on the real data. However, there may be trade-offs.
By adding bells and whistles to the theory, along with a number of arbitrary
parameters, one may be able to focus on the real data, but at the expense of
complicating the theory. Similarly, by allowing appreciable probabilities for very
many possible bodies of data, one may be able to get away with a simple the-
ory. (Occasionally, of course, a theorist is fortunate enough to be able to make
both Y and I small, as James Clerk Maxwell did in the case of the equations
for electromagnetism.) In any case, the first desideratum is to minimize the sum
of the two terms, the total information Σ. Then one can deal with the possible
trade-offs.
We shall show that to within a few bits the smallest possible value of Σ is
K ≡ K(e), the AIC of the string representing the entity itself. Here we make use
of the typicality condition (2) that the log of the (coarse-grained) probability for
the entity is less than or equal to I to within a few bits. We also make use of
certain abstract properties of the AIC:
K(A) <
∼ K(A, B) (4)
and
K(A, B) <
∼ K(B) + K(A|B) , (5)
<
where again the symbol ∼ means “less than or equal to” up to a few bits. A
true information measure would, of course, obey the first relation without the
caveat “up to a few bits” and would obey the second relation as an equality.
Because of efficient recoding, we have
∼ − log Pe .
K(e|E) < (6)
We can now prove that K = K(e) is an approximate lower bound for the total
information Σ = K(E) + I:
We see, too, that when the approximate lower bound is achieved, all these
approximate inequalities become approximate equalities:
K ≈ K(e, E) , (8a)
K(e, E) ≈ Y + K(e|E) , (8b)
K(e|E) ≈ − log Pe , (8c)
− log Pe ≈ I . (8d)
The treatment of this in Gell-Mann and Lloyd [8] is slightly flawed. The
approximate inequality (7b), although given correctly, was accidentally replaced
Santa Fe Institute. June 26, 2003 12:52 p.m. Gell-Mann page 392
just the logarithm of the number of members of the subset. Also, being a typical
member of the ensemble simply means belonging to the subset.
Vitányi and Li describe how, for this model problem, Kolmogorov suggested
maximizing I subject only to staying on the straight line. In that case, as pointed
out above, one is led immediately to the point in the I − Y plane where the
boundary departs from the straight line. Kolmogorov called the value of Y at
that point the “minimum sufficient statistic.” His student L. A. Levin (now a
professor at Boston University) kept pointing out to him that this “statistic” was
always small and therefore of limited utility, but the great man paid insufficient
attention [10].
In the model problem, the boundary curve comes near the I axis at the point
where I achieves its maximum, the string length l. At that point the subset is
the entire set of strings of the same length as the one describing the entity e.
Clearly, that set has a very short description and thus a very small value of Y.
What should be done, whether in this model problem or in the more general
case that we discussed earlier, is to utilize the lowest point on the straight line
such that the average quantities judged to be important still have their fixed
values. Then Y no longer has to be tiny and the measure of ignorance I can be
much less than it was for the case of no further constraints.
We have succeeded, then, in splitting K into two terms, the effective com-
plexity and the measure of random information content, and they are equal to
the values of Y and I, respectively, for the chosen ensemble. We can think of the
separation of K into Y and I in terms of a distinction between a basic program
(for printing out the string representing our entity) and data fed into that basic
program.
We can also treat as a kind of coarse graining the passage from the original
singlet distribution (in which the bit string representing the entity is the only
member with nonzero probability) to an ensemble of which that bit string is a
typical member. In fact, we have been labeling the probabilities in each ensemble
as coarse-grained probabilities Pr . Now it often happens that one ensemble can
be regarded as a coarse graining of another, as was discussed in Gell-Mann and
Lloyd [8]. We can explore that situation here as it applies to ensembles that lie
on or very close to the straight line Y + I = K.
We start from the approximate equalities (8a)–(8d) (accurate to within a
few bits) that characterize an ensemble on or near the straight line. There the
coarse-graining acts on initial “singleton” probabilities that are just one for the
original string and zero for all others. We want to generalize the above formulae
to the case of an ensemble with any initial fine-grained probability distribution
p ≡ {pr }, which gets coarse grained to yield another ensemble with probability
distribution P ≡ {Pr } and approximately the same value of Σ. We propose the
Santa Fe Institute. June 26, 2003 12:52 p.m. Gell-Mann page 394
of eq. (10). The elimination of the I term produces the connection of Kmut with
the effective complexity candidate.
At last we arrive at the questions relevant to a nontraditional measure of
ignorance. Suppose that for some reason we are dealing, in the definition of I,
not with the usual measure given in eq. (1), but rather with the generalization
discussed in this volume, namely
[Σr (Pr )q − 1]
Iq = − , (11)
(q − 1)
which reduces to eq. (1) in the limit where q approaches 1. Should we be maximiz-
ing this measure of ignorance—while keeping certain average quantities fixed—in
order to arrive at a suitable ensemble? (Presumably we average using not the
probabilities Pr but their qth powers normalized so as to sum to unity—the so-
called Escort probabilities.) Do we, while maximizing I, keep a measure of total
information at its minimum value? Is a nonlinear term added to I + Y ? What
happens to the lower bound on I + Y ? Can we make appropriate changes in
the definition of AIC that will preserve or suitably generalize the relations we
discuss here? What happens to the approximate equality of I and the average
contingent AIC (given the ensemble)? What becomes of the four conditions in
eqs. (8a) to (8d)? What happens to the corresponding conditions (9a) to (9d)
for the case where we are coarse graining one probability distribution and thus
obtaining another one?
As is well known, a kind of entropy based on the generalized information or
ignorance of eq. (11) has been suggested [16] as the basis for a full-blown alter-
native, valid for certain situations, to the “thermostatistics” (thermodynamics
and statistical mechanics) of Boltzmann and Gibbs. (The latter is, of course,
founded on eq. (1) as the formula for information or ignorance.) Such a basic
interpretation of eq. (11) has been criticized by authors such as Luzzi et al. [13]
and Nauenberg [14]. We do not address those criticisms here, but should they
prove justified—in whole or in part—they need not rule out, at a practical level,
the applicability of eq. (11) to a variety of cases, such as systems of particles
attracted by 1/r2 forces or systems at the so-called “edge of chaos.”
ACKNOWLEDGMENTS
This research was supported by the National Science Foundation under the
Nanoscale Modeling and Simulation initiative. In addition, the work of Murray
Gell-Mann was supported by the C.O.U.Q. Foundation and by Insight Venture
Management. The generous help provided by these organizations is gratefully
acknowledged.
Santa Fe Institute. June 26, 2003 12:52 p.m. Gell-Mann page 398
REFERENCES
[1] Adami, C., C. Ofria, and T. C. Collier. “Evolution of Biological Complex-
ity.” PNAS (USA) 97 (2000): 4463–4468.
[2] Bennett, C. H. “Dissipation, Information, Computational Complexity and
the Definition of Organization.” In Emerging Syntheses in Science, edited by
D. Pines, 215–234. Santa Fe Institute Studies in the Sciences of Complexity,
Proc. Vol. I. Redwood City: Addison-Wesley, 1987.
[3] Chaitin, G. J. Information, Randomness, and Incompleteness. Singapore:
World Scientific, 1987.
[4] Cover, T. M., and J. A. Thomas. Elements of Information Theory. New
York: Wiley, 1991.
[5] Crutchfield, J. P., and K. Young. “Inferring Statistical Complexity.” Phys.
Rev. Lett. 63 (1989): 105–108.
[6] Gell-Mann, M. The Quark and the Jaguar. New York: W. H. Freeman, 1994.
[7] Gell-Mann, M. “What is Complexity?” Complexity 1/1 (1995): 16–19.
[8] Gell–Mann, M., and S. Lloyd. “Information Measures, Effective Complexity,
and Total Information.” Complexity 2/1 (1996): 44–52.
[9] Jaynes, E. T. Papers on Probability, Statistics and Statistical Physics, edited
by R. D. Rosenkrantz. Reidel: Dordrecht, 1982.
[10] Levin, L. A. Personal communication, 2000.
[11] Li, M., and P. M. B. Vitanyi. An Introduction to Kolmogorov Complexity
and Its Applications. New York: Springer-Verlag, 1993.
[12] Lloyd, S., and H. Pagels. “Complexity as Thermodynamic Depth.” Ann.
Phys. 188 (1988): 186–213.
[13] Luzzi, R., A. R. Vasconcellos, and J. G. Ramos. “On the Question of the So-
Called Non-Extensive Thermodynamics.” IFGW-UNICAMP Internal Re-
port, Universidade Estadual de Campinas, Campinas, Sao Paulo, Brasil,
2002.
[14] Nauenberg, M. “A Critique of Nonextensive q-Entropy for Thermal Statis-
tics. Dec. 2002. lanl.gov e-Print Archive, Quantum Physics, Cornell Univer-
sity. hhttp://eprints.lanl.gov/abs/cond-mat/0210561i.
[15] Schack, R. “Algorithmic Information and Simplicity in Statistical Physics.”
Intl. J. Theor. Phys. 36 (1997): 209–226.
[16] Tsallis, C. “Possible Generalization of Boltzmann-Gibbs Statistics.” J. Stat.
Phys. 52 (1988): 479–487.
[17] Wolpert, D. H., and W. G. Macready. “Self-Dissimmilarity: An Empirically
Observable Measure of Complexity.” In Unifying Themes in Complex Sys-
tems: Proceedings of the First NECSI International Conference, edited by
Y. Bar-Yam, 626–643. Cambridge, Perseus, 2002.