0% found this document useful (0 votes)
15 views

Fe 9

This document provides an outline for a lecture on functional equations. It begins with an introduction that explains the overall goals are to characterize measures of diversity, entropy, means, and norms using unique properties. The document then covers several examples of functional equations, proving theorems about when a function of two variables can be separated into the sum or product of functions of one variable. It addresses questions about existence, construction, and number of possible decompositions that satisfy the given equations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Fe 9

This document provides an outline for a lecture on functional equations. It begins with an introduction that explains the overall goals are to characterize measures of diversity, entropy, means, and norms using unique properties. The document then covers several examples of functional equations, proving theorems about when a function of two variables can be separated into the sum or product of functions of one variable. It addresses questions about existence, construction, and number of possible decompositions that satisfy the given equations.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Functional equations

Tom Leinster∗
Spring 2017

Preamble
Hello.
Admin: email addresses; sections in outline 6= lectures; pace.
Overall plan: interested in unique characterizations of . . .
entropy S means norms
SSS
SSS
SSS
SSS
S
measures of (bio)diversity

There are many ways to measure diversity: long controversy.


Ideal: be able to say ‘If you want your diversity measure to have properties
X, Y and Z, then it must be one of the following measures.’
Similar results have been proved for entropy, means and norms.
This is a tiny part of the field of functional equations!
One ulterior motive: for me to learn something about FEs. I’m not an
expert, and towards end, this will get to edge of research (i.e. I’ll be making it
up as I go along).
Tools:
• native wit
• elementary real analysis
• (new!) some probabilistic methods.
One ref: Aczél and Daróczy, On Measures of Information and Their Char-
acterizations. (Comments.) Other refs: will give as we go along.

1 Warm-up
Week I (7 Feb)
Which functions f satisfy f (x + y) = f (x) + f (y)? Which functions of two
variables can be separated as a product of functions of one variable?

This section is an intro to basic techniques. We may or may not need the
actual results we prove.
∗ School of Mathematics, University of Edinburgh; [email protected]. Last edited on

5 April 2017

1
The Cauchy functional equation
The Cauchy FE on a function f : R → R is

∀x, y ∈ R, f (x + y) = f (x) + f (y). (1)

There are some obvious solutions. Are they the only ones? Weak result first,
to illustrate technique.
Proposition 1.1 Let f : R → R be a differentiable function. TFAE (the fol-
lowing are equivalent):
i. f satisfies (1)
ii. there exists c ∈ R such that

∀x ∈ R, f (x) = cx.

If these conditions hold then c = f (1).

Proof (ii)⇒(i) and last part: obvious.


Now assume (i). Differentiate both sides of (1) with respect to x:

∀x, y ∈ R, f 0 (x + y) = f 0 (x).

Take x = 0: then f 0 (y) = f 0 (0) for all y ∈ R. So f 0 is constant, so there exist


c, d ∈ R such that
∀x ∈ R, f (x) = cx + d.
Substituting back into (1) gives d = 0, proving (ii). 

‘Differentiable’ is a much stronger condition than necessary!


Theorem 1.2 As for Proposition 1.1, but with ‘continuous’ in place of ‘differ-
entiable’.

Proof Let f be a continuous function satisfying (1).


• f (0 + 0) = f (0) + f (0), so f (0) = 0.
• f (x) + f (−x) = f (x + (−x)) = f (0) = 0, so f (−x) = −f (x). Cf. group
homomorphisms.
• Next, f (nx) = nf (x) for all x ∈ R and n ∈ Z. For n > 0, true by
induction. For n = 0, says f (0) = 0. For n < 0, have −n > 0 and so
f (nx) = −f (−nx) = −(−nf (x)) = nf (x).
• In particular, f (n) = nf (1) for all n ∈ Z.
• For m, n ∈ Z with n 6= 0, we have

f (n · m/n) = f (m) = mf (1)

but also
f (n · m/n) = nf (m/n),
so f (m/n) = (m/n)f (1). Hence f (x) = f (1)x for all x ∈ Q.

2
• Now f and x 7→ f (1)x are continuous functions on R agreeing on Q, hence
are equal. 

Remarks 1.3 i. ‘Continuous’ can be relaxed further still. It was pointed


out in class that continuity at 0 is enough. ‘Measurable’ is also enough
(Fréchet, ‘Pri la funkcia ekvacio f (x + y) = f (x) + f (y)’, 1913). Even
weaker: ‘bounded on some set of positive measure’. But never mind! For
this course, I’ll be content to assume continuity.
ii. To get a ‘weird’ solution of Cauchy FE (i.e. not of the form x 7→ cx), need
existence of a non-measurable function. So, need some form of choice. So,
can’t really construct one.
iii. Assuming choice, weird solutions exist. Choose basis B for the vector
space R over Q. Pick b0 6= b1 in B and a function φ : B → R such that
φ(b0 ) = 0 and φ(b1 ) = 1. Extend to Q-linear map f : R → R. Then
f (b0 ) = 0 with b0 6= 0, but f 6≡ 0 since f (b1 ) = 1. So f cannot be of the
form x 7→ cx. But f satisfies the Cauchy functional equation, by linearity.

Variants (got by using the group isomorphism (R, +) ∼


= ((0, ∞), 1) defined
by exp and log):
Corollary 1.4 i. Let f : R → (0, ∞) be a continuous function. TFAE:
• f (x + y) = f (x)f (y) for all x, y
• there exists c ∈ R such that f (x) = ecx for all x.
ii. Let f : (0, ∞) → R be a continuous function. TFAE:
• f (xy) = f (x) + f (y) for all x, y
• there exists c ∈ R such that f (x) = c log x for all x.
iii. Let f : (0, ∞) → (0, ∞) be a continuous function. TFAE:
• f (xy) = f (x)f (y) for all x, y
• there exists c ∈ R such that f (x) = xc for all x.

Proof For (i), define g : R → R by g(x) = log f (x). Then g is continuous and
satisfies Cauchy FE, so g(x) = cx for some constant c, and then f (x) = ecx .
(ii) and (iii): similarly, putting g(x) = f (ex ) and g(x) = log f (ex ). 

Related:
Theorem 1.5 (Erdős?) Let f : Z+ → (0, ∞) be a function satisfying f (mn) =
f (m)f (n) for all m, n ∈ Z+ . (There are loads of solutions: can freely choose
f (p) for every prime p. But . . . ) Suppose that either f (1) ≤ f (2) ≤ · · · or

f (n + 1)
lim = 1.
n→∞ f (n)

Then there exists c ∈ R such that f (n) = nc for all n.

Proof Omitted. 

3
Separation of variables
When can a function of two variables be written as a product/sum of two
functions of one variable? We’ll do sums, but can convert to products as in
Corollary 1.4.
Let X and Y be sets and
f: X ×Y →R
a function. Or can replace R by any abelian group. We seek functions
g : X → R, h: Y → R
such that
∀x ∈ X, y ∈ Y, f (x, y) = g(x) + h(y). (2)
Basic questions:
A Are there any pairs of functions (g, h) satisfying (2)?
B How can we construct all such pairs?
C How many such pairs are there? Clear that if there are any, there are many,
by adding/subtracting constants.

I got up to here in the first class, and was going to lecture the rest of this
section in the second class, but in the end decided not to. What I actually
lectured resumes at the start of Section 2. But for completeness, here’s the rest
of this section.

Attempt to recover g and h from f . Key insight:


f (x, y) − f (x0 , y) = g(x) − g(x0 )
(x, x0 ∈ X, y ∈ Y ). No hs involved!
First lemma: g and h are determined by f , up to additive constant.
Lemma 1.6 Let g : X → R and h : Y → R be functions. Define f : X × Y → R
by (2). Let x0 ∈ X and y0 ∈ Y .
Then there exist c, d ∈ R such that c + d = f (x0 , y0 ) and
g(x) = f (x, y0 ) − c ∀x ∈ X, (3)
h(y) = f (x0 , y) − d ∀y ∈ Y. (4)
Proof Put y = y0 in (2): then
g(x) = f (x, y0 ) − c ∀x ∈ X
where c = h(y0 ). Similarly
h(y) = f (x0 , y) − d ∀y ∈ Y
where d = g(x0 ). Now
c + d = g(x0 ) + h(y0 ) = f (x0 , y0 )
by (2). 

4
But given f (and x0 and y0 ), is every pair (g, h) of this form a solution
of (2)? Not necessarily (but it’s easy to say when). . .

Lemma 1.7 Let f : X × Y → R be a function. Let x0 ∈ X, y0 ∈ Y , and


c, d ∈ R with c + d = f (x0 , y0 ). Define g : X → R by (3) and h : Y → R by (4).
If
f (x, y0 ) + f (x0 , y) = f (x, y) + f (x0 , y0 ) ∀x ∈ X, y ∈ Y
then
f (x, y) = g(x) + h(y) ∀x ∈ X, y ∈ Y.

Proof For all x ∈ X and y ∈ Y ,

g(x) + h(y) = f (x, y0 ) + f (x0 , y) − c − d = f (x, y0 ) + f (x0 , y) − f (x0 , y0 ),

etc. 

Can now answer the basic questions.


Existence of decompositions (A):

Proposition 1.8 Let f : X × Y → R. TFAE:


i. there exist g : X → R and h : Y → R such that

f (x, y) = g(x) + h(y) ∀x ∈ X, y ∈ Y

ii. f (x, y 0 ) + f (x0 , y) = f (x, y) + f (x0 , y 0 ) for all x, x0 , y, y 0 .

Proof (i)⇒(ii): trivial.


(ii)⇒(i): trivial if X = ∅ or Y = ∅. Otherwise, choose x0 ∈ X and y0 ∈ Y ;
then use Lemma 1.7 with c = 0 and d = f (x0 , y0 ). 

Classification of decompositions (B):


Proposition 1.9 Let f : X × Y → R be a function satisfying the equivalent
conditions of Proposition 1.8, and let x0 ∈ X and y0 ∈ Y . Then a pair of
functions (g : X → R, h : Y → R) satisfies (2) if and only if there exist c, d ∈ R
satisfying c + d = f (x0 , y0 ), (3) and (4).

Proof Follows from Lemmas 1.6 and 1.7. 

Number of decompositions (C) (really: dim of solution-space):


Corollary 1.10 Let f : X × Y → R with X, Y nonempty. Either there are no
pairs (g, h) satisfying (2), or for any pair (g, h) satisfying (2), the set of all such
pairs is the 1-dimensional space

{(g + a, h − a) : a ∈ R}. 

5
2 Shannon entropy
Week II (14 Feb)
Recap, including Erdős theorem. No separation of variables!
The many meanings of the word entropy. Ordinary entropy, relative entropy,
conditional entropy, joint entropy, cross entropy; entropy on finite and infinite
spaces; quantum versions; entropy in topological dynamics; . . . . Today we stick
to the very simplest kind: Shannon entropy of a probability distribution on a
finite set.
P Let p = (p1 , . . . , pn ) be a probability distribution on {1, . . . , n} (i.e. pi ≥ 0,
pi = 1). The (Shannon) entropy of p is
X X 1
H(p) = − pi log pi = pi log .
i : pi >0 i : pi >0
pi

The sum is over all i ∈ {1, . . . , n} such that pi 6= 0; equivalently, can sum over
all i ∈ {1, . . . , n} but with the convention that 0 log 0 = 0.
Ways of thinking about entropy:
• Disorder.

• Uniformity. Will see that uniform distribution has greatest entropy among
all distributions on {1, . . . , n}.
• Expected surprise. Think of log(1/pi ) as your surprise at learning that an
event of probability pi has occurred. The smaller pi is, the more surprised
you are. Then H(p) is the expected value of the surprise: how surprised
you expect to be!
• Information. Similar to expected surprise. Think of log(1/pi ) as the infor-
mation that you gain by observing an event of probability pi . The smaller
pi is, the rarer the event is, so the more remarkable it is. Then H(p) is
the average amount of information per event.

• Lack of information (!). Dual viewpoints in information theory. E.g. if p


represents noise, high entropy means more noise. Won’t go into this.
• Genericity. In context of thermodynamics, entropy measures how generic
a state a system is in. Closely related to ‘lack of information’.

First properties:
• H(p) ≥ 0 for all p, with equality iff p = (0, . . . , 0, 1, 0, . . . , 0). Least
uniform distribution.
• H(p) ≤ log n for all p, with equality iff p = (1/n, . . . , 1/n). Most uniform
distribution. Proof that H(p) ≤ log n uses concavity of log:
X  X 
pi log p1i ≤ log pi p1i ≤ log n.

H(p) =
i : pi >0 i : pi >0

• H(p) is continuous in p. (Uses limx→0+ x log x = 0.)

6
Remark 2.1 Base of logarithm usually taken to be e (for theory) or 2 (for
examples and in information theory/digital communication). Changing base of
logarithm scales H by constant factor—harmless!
Examples 2.2 Use log2 here.
i. Uniform distribution on 2k elements:
H( 21k , . . . , 21k ) = log2 (2k ) = k.
Interpretation: knowing results of k fair coin tosses gives k bits of infor-
mation.
ii.
H( 21 , 14 , 18 , 18 ) = 1
2 log2 2 + 1
4 log2 4 + 1
8 log2 8+ 1
8 log2 8
1 1 1
= 2 ·1+ 4 ·2+ 8 ·3 + 18 · 3
= 1 43 .
Interpretation: consider a language with alphabet A, B, C, D, with fre-
quencies 1/2, 1/4, 1/8, 1/8. We want to send messages encoded in binary.
Compare Morse code: use short code sequences for common letters. The
most efficient unambiguous code encodes a letter of frequency 2−k as a
binary string of length k: e.g. here, could use
A: 0, B: 10, C: 110, D: 111.
Then code messages are unambiguous: e.g. 11010011110 can only be
CBADB. Since k = log2 (1/2−k ), mean number of bits per letter is then
3
P
i pi log(1/pi ) = H(p) = 1 4 .

iii. That example was special in that all the probabilities were integer powers
of 2. But. . . Can still make sense of this when probabilities aren’t
powers of 2 (Shannon’s first theorem). E.g. frequency distribution p =
(p1 , . . . , p26 ) of letters in English has H(p) ≈ 4, so can encode English in
about 4 bits/letter. So, it’s as if English had only 16 letters, used equally
often.
Will now explain a more subtle property of entropy. Begin with example.
Example 2.3 Flip a coin. If it’s heads, roll a die. If it’s tails, draw from a pack
of cards. So final outcome is either a number between 1 and 6 or a card. There
are 6 + 52 = 58 possible final outcomes, with probabilities as shown (assuming
everything unbiased):

1/2 1/2
COIN

DIE CARDS

1/6 1/6 1/52 1/52


......

1/12 ...... 1/12 1/104 ...... 1/104

How much information do you expect to get from observing the outcome?

7
• You know result of coin flip, giving H(1/2, 1/2) = 1 bit of info.
• With probability 1/2, you know result of die roll: H(1/6, . . . , 1/6) = log2 6
bits of info.
• With probability 1/2, you know result of card draw: H(1/52, . . . , 1/52) =
log2 52 bits.
In total:
1 1
1+ 2 log2 6 + 2 log2 52
bits of info. This suggests
1 1 1 1 1 1

H 12 , . . . , 12 , 104 , . . . , 104 =1+ 2 log2 6 + 2 log2 52.
| {z } | {z }
6 52

Can check true! Now formulate general rule.

The chain rule Write

∆n = {probability distributions on {1, . . . , n}}.

Geometrically, this is a simplex of dimension n − 1. Given

w ∈ ∆n , p1 ∈ ∆k1 , . . . , pn ∈ ∆kn ,

get composite distribution

w ◦ (p1 , . . . , pn ) = (w1 p11 , . . . , w1 p1k1 , . . . , wn pn1 , . . . , wn pnkn ) ∈ ∆k1 +···+kn

w
......

p1 ...... pn

...... ......

(For cognoscenti: this defines an operad structure on the simplices.)


Easy calculation1 proves chain rule:
n
X
H(w ◦ (p1 , . . . , pn )) = H(w) + wi H(pi ).
i=1

Special case: p1 = · · · = pn . For w ∈ ∆n and p ∈ ∆m , write

w ⊗ p = w ◦ (p, . . . , p) = (w1 p1 , . . . , w1 pm , . . . , wn p1 , . . . , wn pm ) ∈ ∆nm .


| {z }
n
1 This is completely straightforward, but can be made even more transparent by first observ-
ing that the function f (x) = −x log x is a ‘nonlinear derivation’, i.e. f (xy) = xf (y)+f (x)y. In
fact, −x log x is the only measurable function F with this property (up to a constant factor),
since if we put g(x) = F (x)/x then g(xy) = g(y) + g(x) and so g(x) ∝ log x.

8
This is joint probability distribution if the two things are independent. Then
chain rule implies multiplicativity:

H(w ⊗ p) = H(w) + H(p).

Interpretation: information from two independent observations is sum of infor-


mation from each.
Where are the functional equations?
For each n ≥ 1, have function H : ∆n → R+ = [0, ∞). Faddeev2 showed:

Theorem 2.4 (Faddeev, 1956) Take functions I : ∆n → R+ n≥1 . TFAE:

i. the functions I are continuous and satisfy the chain rule;


ii. I = cH for some c ∈ R+ .

That is: up to a constant factor, Shannon entropy is uniquely characterized


by continuity and chain rule.
Should we be disappointed to get scalar multiples of H, not H itself? No:
recall that different scalar multiples correspond to different choices of the base
for log.
Rest of this section: proof of Faddeev’s theorem.
Certainly (ii)⇒(i). Now take I satisfying (i).
 un = (1/n, . . . , 1/n) ∈ ∆n . Strategy: think about the sequence
Write
I(un ) n≥1 . It should be (c log n)n≥1 for some constant c.

Lemma 2.5 i. I(umn ) = I(um ) + I(un ) for all m, n ≥ 1.

ii. I(u1 ) = 0.

Proof For (i), umn = um ⊗ un , so

I(umn ) = I(um ⊗ un ) = I(um ) + I(un )

(by multiplicativity). For (ii), take m = n = 1 in (i). 

Theorem 1.5 (Erdős) would now tell us that I(un ) = c log n for some constant
c (putting f (n) = exp(I(un ))). But to conclude that, we need one of the two
alternative hypotheses of Theorem 1.5 to be satisfied. We prove the second one,
on limits. This takes some effort.
Lemma 2.6 I(1, 0) = 0.

Proof We compute I(1, 0, 0) in two ways. First,

I(1, 0, 0) = I((1, 0) ◦ ((1, 0), u1 )) = I(1, 0) + 1 · I(1, 0) + 0 · I(u1 ) = 2I(1, 0).

Second,

I(1, 0, 0) = I((1, 0) ◦ (u1 , u2 )) = I(1, 0) + 1 · I(u1 ) + 0 · I(u2 ) = I(1, 0)

since I(u1 ) = 0. Hence I(1, 0) = 0. 


2 Dmitry Faddeev, father of the physicist Ludvig Faddeev.

9
To use Erdős, need I(un+1 ) − I(un ) → 0 as n → ∞. Can nearly prove that:
n
Lemma 2.7 I(un+1 ) − n+1 I(un ) → 0 as n → ∞.

Proof We have
n 1

un+1 = n+1 , n ◦ (un , u1 ),
so by the chain rule and I(u1 ) = 0,
n 1 n

I(un+1 ) = I n+1 , n + n+1 I(un ).

So
n n 1

I(un+1 ) − n+1 I(un ) =I n+1 , n+1 → I(1, 0) = 0
as n → ∞, by continuity and Lemma 2.6. 

To improve this to I(un+1 )−I(un ) → 0, use a general result that has nothing
to do with entropy:
n
Lemma 2.8 Let (an )n≥1 be a sequence in R such that an+1 − n+1 an → 0 as
n → ∞. Then an+1 − an → 0 as n → ∞.

Proof Omitted; uses Cesàro convergence.3


Although I omitted this proof in class, I’ll include it here. I’ll follow the
argument in Feinstein, The Foundations of Information Theory, around p.7.
It is enough to prove that an /(n + 1) → 0 as n → ∞. Write b1 = a1 and
bn = an − n−1
n an−1 for n ≥ 2. Then nan = nbn + (n − 1)an−1 for all n ≥ 2, so

nan = nbn + (n − 1)bn−1 + · · · + 1b1

for all n ≥ 1. Dividing through by n(n + 1) gives


an 1
= 2 · mean(b1 , b2 , b2 , b3 , b3 , b3 , . . . , bn , . . . , bn ).
n+1 | {z }
n

Since bn → 0 as n → ∞, the sequence

b1 , b2 , b2 , b3 , b3 , b3 , . . . , bn , . . . , bn , . . .
| {z }
n

also converges to 0. Now a general result of Cesàro states that if a sequence


(xr ) converges to ` then the sequence (x̄r ) also converges to `, where x̄r =
(x1 + · · · + xr )/r. Applying this to the sequence above implies that

mean(b1 , b2 , b2 , b3 , b3 , b3 , . . . , bn , . . . , bn ) → 0 as n → ∞.
| {z }
n

Hence an /(n + 1) → 0 as n → ∞, as required. 

We can now deduce what I(un ) is:


Lemma 2.9 There exists c ∈ R+ such that I(un ) = c log n for all n ≥ 1.
3 Xı̄lı́ng Zhāng pointed out that this is also a consequence of Stolz’s lemma—or as Wikipedia

calls it, the Stolz–Cesàro theorem.

10
Proof We have I(umn ) = I(um ) + I(un ), and by last two lemmas,

I(un+1 ) − I(un ) → 0 as n → ∞.

So can apply Erdős’s theorem (1.5) with f (n) = exp(I(un )) to get f (n) = nc
for some constant c ∈ R. So I(un ) = c log n, and c ≥ 0 since I maps into R+ .

We now know that I = cH on the uniform distributions un . It might seem


like we still have a mountain to climb to get to I = cH for all distributions.
But in fact, it’s easy.
Lemma 2.10 I(p) = cH(p) whenever p1 , . . . , pn are rational.

Proof Write  
k1 kn
p= ,...,
k k
where k1 , . . . , kn ∈ Z and k = k1 + · · · + kn . Then

p ◦ (uk1 , . . . , ukn ) = uk .

Since I satisfies the chain rule and I(ur ) = cH(ur ) for all r,
n
X
I(p) + pi · cH(uki ) = cH(uk ).
i=1

But since cH also satisfies the chain rule,


n
X
cH(p) + pi · cH(uki ) = cH(uk ),
i=1

giving the result. 

Theorem 2.4 follows by continuity.

Week III (21 Feb)

Recap of last time: ∆n , H, chain rule.


Information is a slippery concept to reason about. One day it will seem intu-
itively clear that the distribution (0.5, 0.5) is ‘more informative’ than (0.9, 0.1),
and the next day your intuition will say the opposite. So to make things con-
crete, it’s useful to concentrate on one particular framework: coding.
Slogan:
Entropy is average number of bits/symbol in an optimal encoding.
Consider language with alphabet A, B, . . ., used with frequencies p1 , p2 , . . . , pn .
Want to encode efficiently in binary.
• Revisit Example 2.2(ii).
• In general: if p1 , . . . , pn are all powers of 2, there is an unambiguous
encoding where ith letterP is encoded as string of length log2 (1/pi ). So
mean bits/symbol = i pi log2 (1/pi ) = H(p).

11
• Shannon’s first theorem: for any p ∈ ∆n , in any unambiguous encoding,
mean bits/symbol ≥ H(p); moreover, if you’re clever, can do it in <
H(p) + ε for any ε > 0.

• When pi s aren’t powers of 2, do this by using blocks of symbols rather


than individual symbols.
Chain rule: how many bits/symbol on average to encode French, including
accents?

0.05 0.02 0.03 0.004 Entropy 3.8

......
a b c z

1 1 1 1 1
2 4 4 1 2 2 ...... 1
Entropy 1.5 Entropy 0
a à â b c ç z
Ent 0 Ent 1
We need

3.8
|{z} + 0.05 × 1.5 + 0.02 × 0 + 0.03 × 1 + · · · + 0.004 × 0
| {z }
bits for actual letters
bits for accents

bits/symbol. (Convention: letters are a, b, c, . . . ; symbols are a, à, â, b, c, ç,


. . . .) By the chain rule, this is equal to the entropy of the composite distribution

0.05 × 12 , 0.05 × 14 , 0.05 × 41 , 0.02 × 1, . . . , 0.004 × 1 .




Relative entropy
Let p, r ∈ ∆n . The entropy of p relative to r is
p 
i
X
H(p k r) = pi log .
i : p >0
ri
i

Also called Kullback–Leibler divergence, relative information, or infor-


mation gain.
First properties:

• H(p k r) ≥ 0. Not obvious, as log(pi /ri ) is sometimes positive and some-


times negative. For since log is concave,
r   X 
X i ri
H(p k r) = − pi log ≥ − log pi ≥ − log 1 = 0.
i : p >0
pi i : p >0
pi
i i

• H(p k r) = 0 if and only if p = r. Evidence so far suggests relative entropy


is something like a distance. That’s wrong in that it’s not a metric, but
it’s not too terribly wrong. Will come back to this.

12
• H(p k r) can be arbitrarily large (even for fixed n). E.g.

H((1/2, 1/2) k (t, 1 − t)) → ∞ as t → ∞,

and in fact H(p k r) = ∞ if pi > 0 = ri for some i.


• Write
un = (1/n, . . . , 1/n) ∈ ∆n .
Then
H(p k un ) = log n − H(p).
So entropy is pretty much a special case of relative entropy.
• H(p k r) 6= H(r k p). E.g.

H(u2 k (0, 1)) = ∞,


H((0, 1) k u2 ) = log 2 − H((0, 1)) = log 2.

Will come back to this too.

Coding interpretation Convenient fiction: for each ‘language’ p, there is


an encoding for p using log(1/pi ) bits for the ith symbol, hence with mean
bits/symbol = H(p) exactly. Call this ‘machine p’.
We have
X 1 X 1
H(p k r) = pi log − pi log
ri pi
= (bits/symbol to encode language p using machine r)
− (bits/symbol to encode language p using machine p)

So relative entropy is the number of extra bits needed if you use the wrong
machine. Or: penalty you pay for using the wrong machine. Explains why
H(p k r) ≥ 0 with equality if p = r.
If ri = 0 then in machine r, the ith symbol has an infinitely long code word.
Or if you like: if ri = 2−1000 then its code word has length 1000. So if also
pi > 0 then for language p encoded using machine r, average bits/symbol = ∞.
This explains why H(p k r) = ∞.
Taking r = un ,

H(p k un ) = log n − H(p) ≤ log n.

Explanation: in machine un , every symbol is encoded with log n bits, so the


average extra bits/symbol caused by using machine un instead of machine p is
≤ log n.
Now a couple of slightly more esoteric comments, pointing in different math-
ematical directions (both away from functional equations). Tune out if you
want. . .

13
Measure-theoretic perspective Slogan:
All entropy is relative.

Attempt to generalize definition of entropy from probability measures


R on finite
sets to arbitrary probability measures µ: want to say H(µ) = − log(µ) dµ, but
this makes no sense! P
Note that the finite definition H(p) = − pi log pi implicitly refers to count-
ing measure. . .
However, can generalize relative entropy. Given measures µ and ν on mea-
surable space X, define
Z  dµ 
H(µ k ν) = log dµ
X dν

where dµdν is Radon–Nikodym derivative. This makes sense and is the right
definition.
People do talk about the entropy of probability distributions on Rn . For
instance, the entropy
R of a probability density function f on R is usually defined
as H(f ) = − R f (x) log f (x) dx, and it’s an important result that among all
density functions on R with a given mean and variance, the one with the max-
imal entropy is the normal distribution. (This is related to the central limit
theorem.) But here we’re implicitly using Lebesgue measure λ on R; so there
are two measures in play, λ and f λ, and H(f ) = H(f λ k λ).

Local behaviour of relative entropy P Take two close-together distributions


p, p + δ ∈ ∆n . (So δ = (δ1 , . . . , δn ) with δi = 0.) Taylor expansion gives
1X 1 2
H(p + δ k p) ≈ δ
2 pi i
for δ small. Precisely:
1X 1 2
H(p + δ k p) = δ + O(kδk3 ) as δ → 0.
2 pi i

(Here k·k is any norm on Rn . It doesn’t matter which, as they’re all equivalent.)
So:
Locally, H(− k −) is like a squared distance.
In particular, locally (to second order) it’s symmetric.
The square root of relative entropy is not a metric on ∆n : not symmetric and
fails triangle
p inequality.
p (E.g. putpp = (0.9, 0.1), q = (0.2, 0.8), r = (0.1, 0.9).
Then H(p k q) + H(q k r) < H(p k r).) But using it as a ‘local distance’
leads to important things, e.g. Fisher information (statistics), the Jeffreys prior
(Bayesian statistics), and the whole subject of information geometry.

Next week: a unique characterization of relative entropy.

14
Week IV (28 Feb)
Recap: entropy as mean bits/symbol in optimal encoding; definition of rel-
ative entropy; relative entropy as extra cost of using wrong machine.
Write
An = {(p, r) ∈ ∆n × ∆n : ri = 0 =⇒ pi = 0}.
So H(p k r) < ∞ ⇐⇒ (p, r) ∈ An . (Secretly, A stands for ‘absolutely
continuous’.) Then for each n ≥ 1, we have the function

H(− k −) : An → R+ .

Properties:
Measurability H(− k −) is measurable. If you don’t know what that means,
ignore: very mild condition. Every function that anyone has ever written
down a formula for, or ever will, is measurable.
Permutation-invariance E.g. H((p1 , p2 , p3 ) k (r1 , r2 , r3 )) = H((p2 , p3 , p1 ) k
(r2 , r3 , r1 )). It’s that kind of symmetry, not the kind where you swap p
and r.
Vanishing H(p k p) = 0 for all p. Remember: H(p k r) is extra cost of using
machine r (instead of machine p) to encode language p.
Chain rule For all

e ∈ An , (p1 , p
(w, w) e 1 ) ∈ Ak1 , . . . , (pn , p
e n ) ∈ Ak n ,

we have
n
X
H(w ◦ (p1 , . . . , pn ) k w p1 , . . . , p
e ◦ (e e n )) = H(w k w)
e + wi H(pi k p
e i ).
i=1

To understand chain rule for relative entropy:


Example 2.11 Recall the letter/accent tree from last time.
Swiss French and Canadian French are written using the same alphabet and
accents, but with slightly different words, hence different frequencies of letters
and accents.
Consider Swiss French and Canadian French:

w ∈ ∆26 : frequencies of letters in Swiss


e ∈ ∆26 : frequencies of letters in Canadian
w

and then

p1 ∈ ∆3 : frequencies of accents on ‘a’ in Swiss


e 1 ∈ ∆3 : frequencies of accents on ‘a’ in Canadian
p
..
.
p26 ∈ ∆1 : frequencies of accents on ‘z’ in Swiss
e 26 ∈ ∆1 : frequencies of accents on ‘z’ in Canadian.
p

15
So
w ◦ (p1 , . . . , p26 ) = frequency distribution of all symbols in Swiss
p1 , . . . , p
e ◦ (e
w e 26 ) = frequency distribution of all symbols in Canadian
where ‘symbol’ means a letter with (or without) an accent.
Now encode Swiss using Canadian machine. How much extra does it cost
(in mean bits/symbol) compared to encoding Swiss using Swiss machine?

mean extra cost per symbol =


mean extra cost per letter + mean extra cost per accent.
And that’s the chain rule. Coefficient in sum is wi , not w
ei , because it’s Swiss
we’re encoding.
Clearly any scalar multiple of relative entropy also has these four properties
(measurability, permutation-invariance, vanishing, chain rule).

Theorem 2.12 Take functions I(− k −) : An → R+ n≥1 . TFAE:
i. the functions I satisfy measurability, permutation-invariance, vanishing
and the chain rule;
ii. I(− k −) = cH(− k −) for some c ∈ R+ .
It’s hard to believe that Theorem 2.12 is new; it could have been proved in
the 1950s. However, I haven’t found it in the literature. The proof that follows
is inspired by work of Baez and Fritz, who in turn built on and corrected work
of Petz. Cf. Baez and Fritz (arXiv:1402.3067) and Petz (cited by B&F).
Take I(− k −) satisfying (i). Define L : (0, 1] → R+ by
L(α) = I((1, 0) k (α, 1 − α)).
Note ((1, 0), (α, 1 − α)) ∈ A2 , so RHS is defined. If I = H then L(α) = − log α.
Lemma 2.13 Let (p, r) ∈ An with pk+1 = · · · = pn = 0, where 1 ≤ k ≤ n.
Then r1 + · · · + rk > 0, and
I(p k r) = L(r1 + · · · + rk ) + I(p0 k r0 )
where
(r1 , . . . , rk )
p0 = (p1 , . . . , pk ), r0 = .
r1 + · · · + rk
Proof Case k = n trivial; suppose k < n.
Since p is a probability distribution with pi = 0 for all i > k, there is some
i ≤ k such that pi > 0, and then ri > 0 since (p, r) ∈ An . Hence r1 +· · ·+rk > 0.
By definition of operadic composition,
 
I(p k r) = I (1, 0) ◦ (p0 , r00 ) (r1 + · · · + rk , rk+1 + · · · + rn ) ◦ (r0 , r00 )

where r00 is the normalization of (rk+1 , . . . , rn ) if rk+1 +· · ·+rn > 0, or is chosen


arbitrarily in ∆n−k otherwise. (The set ∆n−k is nonempty since k < n.) By
chain rule, this is equal to
L(r1 + · · · + rk ) + 1 · I(p0 k r0 ) + 0 · I(r00 k r00 ),
and result follows. In order to use the chain rule, we needed to know that various
pairs were in A2 , Ak , etc.; that’s easily checked. 

16
Lemma 2.14 L(αβ) = L(α) + L(β) for all α, β ∈ (0, 1].

Proof Consider

x := I((1, 0, 0) k (αβ, α(1 − β), 1 − α)).

On one hand, Lemma 2.13 with k = 1 gives

x = L(αβ) + I((1) k (1)) = L(αβ)

On other, Lemma 2.13 with k = 2 gives

x = L(α) + I((1, 0) k (β, 1 − β)) = L(α) + L(β).

Result follows. 

Lemma 2.15 There is a unique constant c ∈ R+ such that L(α) = −c log α for
all α ∈ (0, 1].

Proof Follows from Lemma 2.14 and measurability, as in Cor 1.4. That corol-
lary was stated under the hypothesis of continuity, but measurability would have
been enough. 

Now we come to a clever part of Baez and Fritz’s argument.


Lemma 2.16 Let (p, r) ∈ An and suppose that pi > 0 for all i. Then I(pkr) =
cH(p k r).

Proof We have (p, r) ∈ An , so ri > 0 for all i. So can choose α ∈ (0, 1] such
that ri − αpi ≥ 0 for all i.
We will compute the (well-defined) number
 
x := I (p1 , . . . , pn , 0, . . . , 0) (αp1 , . . . , αpn , r1 − αp1 , . . . , rn − αpn )
| {z }
n

in two ways. First, by Lemma 2.13 and the vanishing property,

x = L(α) + I(p k p) = −c log α.

Second, by symmetry and then the chain rule,

x = I((p1 , 0, . . . , pn , 0) k (αp1 , r1 − αp1 , . . . , pn , rn − αpn ))


 
= I p ◦ (1, 0), . . . , (1, 0) r ◦ (α pr11 , 1 − α pr11 ), . . . , (α prnn , 1 − α prnn )


n
X
pi L α prii

= I(p k r) +
i=1
= I(p k r) − c log α − cH(p k r).

Comparing the two expressions for x gives the result. 

17
Proof of Theorem 2.12 Let (p, r) ∈ An . By symmetry, can assume
p1 > 0, . . . , pk > 0, pk+1 = 0, . . . , pn = 0
where 1 ≤ k ≤ n. Writing R = r1 + · · · + rk ,
1
I(p k r) = L(R) + I((p1 , . . . , pk ) k R (r1 , . . . , rk )) by Lemma 2.13
= −c log R + cH((p1 , . . . , pk ) k R1 (r1 , . . . , rk )) by Lemmas 2.15 and 2.16.
This holds for all I(− k −) satisfying the four conditions; in particular, it holds
for cH(− k −). Hence I(p k r) = cH(p k r). 
Remark 2.17 Assuming permutation-invariance, the chain rule is equivalent
to a very special case:

H((tw1 , (1 − t)w1 , w2 , . . . , wn ) k (ee1 , (1 − e


tw t)w
e1 , w
e2 , . . . , w
en ))
= H(w k w)
e + w1 H((t, 1 − t) k (e
t, 1 − e
t))

e ∈ An and ((t, 1 − t), (e


for all (w, w) t, 1 − e
t)) ∈ A2 . (Proof: induction.)

Next week: I’ll introduce a family of ‘deformations’ or ‘quantum versions’


of entropy and relative entropy. This turns out to be important if we want a
balanced perspective on what biodiversity is.

Week V (7 Mar)
Recap: formulas for entropy and cross entropy (in both log(1/something)
and − log(something) forms); chain rules for both.
Remark 2.18 I need to make explicit something we’ve been using implicitly
for a while.
Let p ∈ ∆n and r ∈ ∆m . Then we get a new distribution
p ⊗ r = (p1 r1 , . . . , p1 rm ,
..
.
pn r1 , . . . , pn rm )
= p ◦ (r, . . . , r)
| {z }
n
∈ ∆nm .
Probabilistically, this is the joint distribution of independent random variables
distributed according to p and r.
A special case of the chain rule for entropy:
H(p ⊗ r) = H(p) + H(r)
and similarly for relative entropy:
H(p ⊗ r k p
e ⊗e
r) = H(p k p
e ) + H(r k e
r).
(So H and H(− k −) are log-like; like a higher version of Cauchy’s functional
equation.)

18
3 Deformed entropies
Shannon entropy is just a single member of a one-parameter family of entropies.
In fact, there are two different one-parameter families of entropies, both con-
taining Shannon entropy as a member. In some sense, these two families of
entropies are equivalent, but they have different flavours.
I’ll talk about both families: surprise entropies in this section, then Rényi
entropies when we come to measures of diversity later on.
Definition 3.1 Let q ∈ [0, ∞). The q-logarithm lnq : (0, ∞) → R is defined
by ( 1−q
x −1
1−q if q 6= 1,
lnq (x) =
ln(x) if q = 1.

Notation: log vs. ln.


Then lnq (x) is continuous in q (proof: l’Hôpital’s rule). Warning:

lnq (xy) 6= lnq (x) + lnq (y), lnq (1/x) 6= − lnq (x).

First one inevitable: we already showed that scalar multiples of actual logarithm
are the only continuous functions that convert multiplication into addition. In
fact, there’s quite a neat formula for lnq (xy) in terms of lnq (x), lnq (y) and q:
exercise!
For a probability p ∈ [0, 1], can view

lnq (1/p)

as one’s ‘surprise’ at witnessing an event with probability p. (Decreasing in p;


takes value 0 at p = 1.)

q=0

q=1
lnq (1/p)

q=2

q=3

Definition 3.2 Let q ∈ [0, ∞). The surprise entropy of order q of p ∈ ∆n


is ( P q
1

X
1−q pi − 1 if q 6= 1,
Sq (p) = pi lnq (1/pi ) = P
i : pi >0
pi ln(1/pi ) if q = 1.

Interpretation: Sq (p) is the expected surprise of an event drawn from p.


Usually called ‘Tsallis entropies’, because Tsallis discovered them in physics
in 1988, after Havrda and Charvat discovered and developed them in information

19
theory in 1967, and after Vajda (1968), Daróczy (1970) and Sharma and Mittal
(1975) developed them further, and after Patil and Taillie used them as measures
of biodiversity in 1982.
Tsallis says the parameter q can be interpreted as the ‘degree of non-
extensivity’. When we come to diversity measures, I’ll give another interpreta-
tion.
Special cases:
• S0 (p) = |{i : pi > 0}| − 1
• S2 (p) = 1 − i p2i = probability that two random elements of {1, . . . , n}
P
(chosen according to p) are different.
Properties of Sq (for fixed q):
• Permutation-invariance, e.g. Sq (p2 , p3 , p1 ) = Sq (p1 , p2 , p3 ).
• q-chain rule:
X
Sq (w ◦ (p1 , . . . , pn )) = Sq (w) + wiq Sq (pi ).
i : wi >0

• q-multiplicativity: as special case of chain rule,


X 
Sq (w ⊗ p) = Sq (w) + wiq Sq (p).
i

Theorem 3.3 Let q ∈ [0, ∞)\{1}. Let I : ∆n → R+ n≥1
be functions. TFAE:
i. I is permutation-invariant and q-multiplicative in sense above;
ii. I = cSq for some c ∈ R+ .
No regularity condition needed! And don’t need full chain rule—just a spe-
cial case, multiplicativity.
Following proof is extracted from Aczél and Daróczy’s book (Theorem 6.3.9),
but they make it look way more complicated. The key point is that the multi-
plicativity property is not symmetric.
Proof (ii)⇒(i) easy. Assume (i). By permutation-invariance,
I(p ⊗ r) = I(r ⊗ p)
for all p ∈ ∆n and r ∈ ∆m . So by q-multiplicativity,
X q X q
I(p) + pi I(r) = I(r) + ri I(p),

hence X X q
riq I(p) = 1 −

1− pi I(r).
Now want to get the ps onPone side and the rs on the other, to deduce that
I(p) is proportional to 1 − pqi . But need to be careful about division by zero.
Take r = u2 = (1/2, 1/2): then
X q
(1 − 21−q )I(p) = 1 − pi I(u2 )
1−q
for all p. But q 6= 1, so 1 − 21−q 6= 0. So I = cSq where c = I(u2 ) 21−q −1 . 

20
Relative entropy generalizes easily too, i.e. extends all along the family from
the point q = 1. Only ticklish point is lnq (1/x) vs. − lnq (x).
Definition 3.4 Let q ∈ [0, ∞). For p, r ∈ ∆n , the relative surprise entropy
of order q is
(  P q 1−q 
1
1−q 1 − pi ri if q 6= 1,
r 
i
X
Sq (p k r) = − pi lnq = P
pi p

i : pi >0 pi log rii if q = 1.

Properties:
• Permutation-invariance (as in case q = 1).
• q-chain rule:
X
Sq (w◦(p1 , . . . , pn )kw◦(e
e p1 , . . . , p
e n )) = Sq (wkw)+
e wiq w
ei1−q Sq (pi ke
pi ).
i : wi >0

• q-multiplicativity: as special case of chain rule,


 X 
Sq (w ⊗ p k w e ⊗pe ) = Sq (w k w)
e + wiq w
ei1−q Sq (p k p
e ).
i : wi >0

Again, there’s a ludicrously simple characterization theorem that needs no


regularity condition. Nor does it need the vanishing condition of Theorem 2.12.
Recall notation:
An = {(p, r) ∈ ∆n × ∆n : ri = 0 =⇒ pi = 0}.
Then Sq (p k r) < ∞ ⇐⇒ (p, r) ∈ An .

Theorem 3.5 Let q ∈ [0, ∞) \ {1}. Let I(− k −) : An → R+ n≥1
be functions.
TFAE:
i. I(− k −) is permutation-invariant and q-multiplicative in sense above;
ii. I(− k −) = cSq (− k −) for some c ∈ R+ .
Proof (ii)⇒(i) easy. Assume (i). By permutation-invariance,
I(p ⊗ r k p
e ⊗e
r) = I(r ⊗ p k e
r⊗p
e)
e ) ∈ An and (r, e
for all (p, p r) ∈ Am . So by q-multiplicativity,
X  X 
I(p k pe) + pqi pe1−q
i I(r k e
r) = I(r k e
r) + riq rei1−q I(p k p
e ),

hence  X   X q 1−q 
1− riq rei1−q I(p k p
e) = 1 − pi pei I(r k e
r).
Take r = (1, 0) and e
r = u2 : then
X
(1 − 2q−1 )I(p k p pqi pe1−q

e ) = I((1, 0) k u2 ) 1 − i

e ) ∈ An . But q 6= 1, so 1 − 2q−1 6= 0. So I(− k −) = cSq (− k −) where


for all (p, p
1−q
c = I((1, 0) k u2 ) 21−q −1 . 
Next week: how to use probability theory to solve functional equations.

21
4 Probabilistic methods
Week VI (14 Mar)
How to use probability theory to solve functional equations.
References: Aubrun and Nechita, arXiv:1102.2618, S.R.S. Varadhan, Large
deviations.

Preview I want to make this broadly accessible, so I need to spend some time
explaining the background before I can show how to actually solve functional
equations using probability theory. We won’t reach the punchline until next
time. But to give the rough idea. . .
Functional equations have no stochastic element to them. So how could
probability theory possibly help to solve them?
Basic idea: use probability theory to replace complicated, exact formulas by
simple, approximate formulas.
Sometimes, an approximation is all you need.
E.g. (very simple example) multiply out the expression

(x + y)1000 = (x + y)(x + y) · · · (x + y).

What terms do we obtain?


• Exact but complicated answer: every term is of the form xi y j with 0 ≤
i ≤ 1000 and i + j = 1000, and this term occurs 1000!/i!j! times.
• Simple but approximate answer: most terms are of the form xi y j where i
and j are about 500. (Flip a fair coin 1000 times, and you’ll usually get
about 500 heads.)
Use probability theory to get descriptions like second. (Different tools of prob-
ability theory allow us to be give more or less specific meanings to ‘mostly’ and
‘about’, depending on how good an approximation we need.)
Aubrun and Nechita used this method to characterize the p-norms. Can also
use their method to characterize means, diversity measures, etc.
We’ll need two pieces of background: basic result on large deviations, basic
result on convex duality, then how they come together.

Cramér’s large deviation theorem


Let X1 , X2 , . . . be independent identically distributed (IID) real random vari-
ables. Write
X r = 1r (X1 + · · · + Xr )
(also a random variable). Fix x ∈ R, and consider behaviour of P(X r ≥ x) for
large r.
• Law of large numbers gives
(
1 if x < E(X),
P r(X r ≥ x) →
0 if x > E(X)

where X is distributed identically to X1 , X2 , . . .. But could ask for more


fine-grained information: how fast does P(X r ≥ x) converge?

22
• Central limit theorem: X r is roughly normal for each individual large r.
• Large deviation theory is orthogonal to CLT and tells us rate of conver-
gence of P(X r ≥ x) as r → ∞.
Rough result: there is a constant k(x) such that for large r,
P(X r ≥ x) ≈ k(x)r .
If x < E(X) then k(x) = 1. Focus on x > E(X); then k(x) < 1 and P(X r ≥ x)
decays exponentially with r.
Theorem 4.1 (Cramér) The limit
lim P(X r ≥ x)1/r
r→∞

exists and (we even have a formula!) is equal to


E(eλX )
inf .
λ≥0 eλx
This is a standard result. Nice short proof: Cerf and Petit, arXiv:1002.3496.
C&P state it without any kind of hypothesis on finiteness of moments; is that
correct?
Remarks 4.2 i. E(eλX ) is by definition the moment generating func-
tion (MGF) mX (λ) of X. For all the standard distributions, the MGF
is known and easy to look up, so this inf on the RHS is easy to compute.
E.g. dead easy for normal distribution.
ii. The limit is equal to

1 if x ≤ E(X),
E(eλX )
 inf eλx if x ≥ E(X).
λ∈R

Not too hard to deduce.


Example 4.3 Let c1 , . . . , cn ∈ R. Let X, X1 , X2 , . . . take values c1 , . . . , cn with
probability 1/n each. (If some ci s are same, increase probabilities accordingly:
e.g. for 7, 7, 8, our random variables take value 7 with probability 2/3 and 8
with probability 1/3.) Then:
• For x ∈ R,
1
P(X r ≥ x) = nr {(i1 , . . . , ir ) : ci1 + · · · + cir ≥ rx} .

• For λ ∈ R,
E(eλx ) = 1
ec1 λ + · · · + ecn λ

n

• So by Cramér, for x ∈ R,
1/r ec1 λ + · · · + ecn λ
lim {(i1 , . . . , ir ) : ci1 + · · · + cir ≥ rx} = inf .
r→∞ λ≥0 exλ
So, we’ve used probability theory to say something nontrivial about a
completely deterministic situation.
To get further, need a second tool: convex duality.

23
Convex duality
Let f : R → [−∞, ∞] be a function. Its convex conjugate or Legendre-
Fenchel transform is the function f ∗ : R → [−∞, ∞] defined by

f ∗ (λ) = sup(λx − f (x)).


x∈R

Can show f ∗ is convex, hence the name.


If f is differentiable, then f ∗ describes y-intercept of tangent line as a func-
tion of its gradient.

f (x)

gradient λ

x

(0, −f (λ))

In proper generality, f is a function on a real vector space and f ∗ is a function


on its dual space. In even more proper generality, there’s a nice paper by Simon
Willerton exhibiting this transform as a special case of a general categorical
construction (arXiv:1501.03791).
Example 4.4 If f, g : R → R are differentiable convex functions with f 0 =
(g 0 )−1 then g = f ∗ and f = g ∗ . (E.g. true if f (x) = xp /p and g(x) = xq /q with
1/p + 1/q = 1—‘conjugate exponents’.)

Theorem 4.5 (Fenchel–Moreau) If f : R → R is convex then f ∗∗ = f .

Since conjugate of anything is convex, can only possibly have f ∗∗ = f if f


is convex. So this is best possible result.

Back to large deviations


Cramér’s theorem secretly involves a convex conjugate. Let’s see how.
Take IID real random variables X, X1 , X2 , . . . as before. For x ≥ E(X),
Cramér says
mX (λ)
lim P(X r ≥ x)1/r = inf
r→∞ λ∈R eλx
where mX (λ) = E(eλX ), i.e. (taking logs)

lim 1 log P(X r ≥ x) = inf (log mX (λ) − λx)


r→∞ r λ∈R
= − sup(λx − log mX (λ))
λ∈R
= −(log mX )∗ (x)

24
or equivalently
(log mX )∗ (x) = lim − 1r log P(X r ≥ x).
r→∞

Now some hand-waving. Ignoring fact that this only holds for x ∈ E(X), take
conjugate of each side. Then for all λ ≥ 0 (a restriction we need to make the
hand-waving work),

(log mX )∗∗ (λ) = sup λx + lim 1r log P(X r ≥ x) .



x∈R r→∞

It’s a general fact that log mX (called the cumulant generating function) is
convex. Hence (log mX )∗∗ = log mX by Fenchel–Moreau. So (taking exponen-
tials)  
mX (λ) = sup lim eλx P(X r ≥ x)1/r .
x∈R r→∞

This really is true. It’s a general formula for the moment generating function.
To make the hand-waving respectable: the full formula is
(
∗ limr→∞ − 1r log P(X r ≥ x) if x ≥ E(X),
(log mX ) (x) =
limr→∞ − 1r log P(X r ≤ x) if x ≤ E(X).

Can prove this by applying Cramér to −X and −x. Then can use it to get the
formula above for (log mX )∗∗ (λ) when λ ≥ 0.
We’ll use a variant:
Theorem 4.6 (Dual Cramér) For any IID X, X1 , X2 , . . ., for all λ ≥ 0,
 
mX (λ) = sup sup eλx P(X r ≥ x)1/r .
x∈R r≥1

Proof See Cerf and Petit, who use it to prove Cramér. 

This is the result we’ll use (not Cramér’s theorem itself). Now let’s apply
it.
Example 4.7 As before, let c1 , . . . , cn ∈ R and consider uniform distribution
on c1 , . . . , cn . Dual Cramér gives
 1/r 
ec1 λ + · · · + ecn λ = eλx

sup (i1 , . . . , ir ) : ci1 + · · · + cir ≥ rx
x∈R,r≥1

for all λ ≥ 0.

Remember the background to all this: Aubrun and Nechita used large de-
viation theory to prove a unique characterization of the p-norms. Since the
left-hand side here is a sum of powers (think λ = p), we can now begin to see
the connection.

Next time: we’ll exploit this expression for sums of powers. We’ll use it
to prove a theorem that pins down what’s so special about the p-norms, and
another theorem saying what’s so special about power means.

25
Week VII (21 Mar)
Last time: Cramér’s theorem and its convex dual. Crucial result: Exam-
ple 4.7.
Today: we’ll use this to give a unique characterization of the p-norms.
I like theorems arising from ‘mathematical anthropology’. You observe some
group of mathematicians and notice that they seem to place great importance
on some particular object: for instance, algebraic topologists are always talking
about simplicial sets, representation theorists place great emphasis on charac-
ters, certain kinds of analyst make common use of the Fourier transform, and
other analysts often talk about the p-norms.
Then you can ask: why do they attach such importance to that object, not
something slightly different? Is it the only object that enjoys the properties it
enjoys? If not, why do they use the object they use and not some other object
enjoying those properties? (Are they missing a trick?) And if it is the only
object with those properties, we should be able to prove it!
We’ll do this now for the p-norms, which are very standard in analysis.
Following theorem and proof are from Aubrun and Nechita, arXiv:1102.2618.
I believe they were the pioneers of this probabilistic method for solving func-
tional equations. See also Tricki, ‘The tensor power trick’.
Notation: for a set I (which will always be finite for us), write

RI = {functions I → R} = {families (xi )i∈I of reals}.

E.g. R{1,...,n} = Rn . In some sense we might as well only use Rn , because every
RI is isomorphic to Rn . But the notation will be smoother if we allow arbitrary
finite sets.
Definition 4.8 Let I be a set. A norm on RI is a function RI → R+ , written
x 7→ kxk, satisfying:
i. kxk = 0 ⇐⇒ x = 0;
ii. kcxk = |c| kxk for all c ∈ R, x ∈ RI ;
iii. kx + yk ≤ kxk + kyk.

There are many norms, but the following family comes up more often than
any other.
Example 4.9 Let I be any finite set and p ∈ [1, ∞]. The p-norm k · kp on RI
is defined for x = (xi )i∈I ∈ RI by
( P
p 1/p

kxkp = i∈I |xi | if p < ∞,
maxi∈I |xi | if p = ∞.

Then kxk∞ = limp→∞ kxkp .

Aubrun and Nechita prove and use following result, which they don’t quite
make explicit:
Proposition 4.10 For p ∈ [1, ∞) and x ∈ (0, ∞)n ,
 
1/rp
kxkp = sup u · {(i1 , . . . , ir ) : xi1 · · · xir ≥ ur } .
u>0,r≥1

26
Proof In Example 4.7, put ci = log xi and λ = p. Also substitute u = ex ,
noting that the letter x now means something else. Then
 
1/r
xp1 + · · · + xpn = sup up · {(i1 , . . . , ir ) : xi1 · · · xir ≥ ur } .
u>0,r≥1

Then take pth root of each side. 


Fix p ∈ [1, ∞]. For each finite set I, have the p-norm on RI . These enjoy
some special properties. Special properties of the p-norms:
• We have
k(x3 , x2 , x1 )kp = k(x1 , x2 , x3 )kp (5)
(etc.) and
k(x1 , x2 , x3 , 0)kp = k(x1 , x2 , x3 )kp (6)
(etc). Generally, any injection f : I → J induces a map f∗ : R → RJ , I

defined by relabelling according to f and padding out with zeros, i.e.


(
xi if j = f (i) for some i ∈ I,
(f∗ x)j =
0 otherwise

(x ∈ RI , j ∈ J). Then
kf∗ xkp = kxkp (7)
I
for all injections f : I → J and x ∈ R . For permutations f of {1, . . . , n},
(7) gives equations such as (5); for inclusion {1, . . . , n} ,→ {1, . . . , n, n+1},
(7) gives equations such as (6).
• For A, B, x, y, z ∈ R, we have
k(Ax, Ay, Az, Bx, By, Bz)kp = k(A, B)kp k(x, y, z)kp
(etc). Generally, for x = (xi )i∈I ∈ RI and y ∈ RJ , define
x ⊗ y = (xi yj )i∈I,j∈J ∈ RI×J .
(If you identify RI ⊗ RJ with RI×J , as you can for finite sets, then x ⊗ y
means what you think.) Then
kx ⊗ ykp = kxkp kykp (8)
I J
for all finite sets I and J, x ∈ R , and y ∈ R .
That’s all!
Definition 4.11 A system of norms consists of a norm k · k on RI for each
finite set I, satisfying (7). This just guarantees that the norms on RI , for
different sets I, hang together nicely. It is multiplicative if (8) holds. That’s
the crucial property of the p-norms.
Example 4.12 For each p ∈ [1, ∞], the p-norm k · kp is a multiplicative system
of norms.
Theorem 4.13 (Aubrun and Nechita) Every multiplicative system of
norms is equal to k · kp for some p ∈ [1, ∞].
Rest of today: prove this.
Let k · k be a multiplicative system of norms.

27
Step 1: elementary results
Lemma 4.14 Let I be a finite set and x, y ∈ RI .
i. If yi = ±xi for each i ∈ I then kxk = kyk.
ii. If yi = |xi | for each i ∈ I then kxk = kyk.
iii. If 0 ≤ xi ≤ yi for each i ∈ I then kxk ≤ kyk.
Proof Omitted in class, but here it is.
i. x ⊗ (1, −1) is a permutation of y ⊗ (1, −1), so
kx ⊗ (1, −1)k = ky ⊗ (1, −1)k,
or equivalently
kxk k(1, −1)k = kyk k(1, −1)k,
and so kxk = kyk.
ii. Immediate from last part.
iii. For each a ∈ {1, −1}I , have vector (ai yi )i∈I ∈ RI . Let S be the set of
I has n elements then S has 2n elements. Then the
such vectors. So if Q
convex hull of S is i∈I [−yi , yi ], which contains x:

(−y1 , y2 ) y = (y1 , y2 )
x = (x1 , x2 )

(−y1 , −y2 ) (y1 , −y2 )

But every vector in S has norm kyk (by first part), so kxk ≤ kyk by
triangle inequality. 

Step 2: finding p Write


1n = (1, . . . , 1) ∈ Rn .
(Thought: k1n kp = n1/p .) Then
k1mn k = k1m ⊗ 1n k = k1m k k1n k
by multiplicativity, and
k1n k = k(1, . . . , 1, 0)k ≤ k1n+1 k
| {z }
n

by Lemma 4.14(iii), so by Theorem 1.5, there exists c ≥ 0 such that k1n k = nc


for all n ≥ 0. By triangle inequality, k1m+n k ≤ k1m k + k1n k, which implies
c ∈ [0, 1]. Put p = 1/c ∈ [1, ∞]. Then k1n k = n1/p = k1n kp .

28
Step 3: the case p = ∞ If p = ∞, easy to show k · k = k · k∞ . (Omitted in
class, but here it is.) Let p = ∞, that is, c = 0. By Lemma 4.14, enough to
prove kxk = kxk∞ for all x ∈ Rn such that xi ≥ 0 for all i. Choose j such that
xj = kxk∞ . Using Lemma 4.14(iii),

kxk ≤ k(xj , . . . , xj )k = xj k1n k = xj = kxk∞ .

On the other hand, using Lemma 4.14(iii) again,

kxk ≥ k(0, . . . , 0, xj , 0, . . . , 0)k = k(xj )k = xj k11 k = xj = kxk∞ .

So xj = kxk∞ , as required. Now assume p < ∞.

Step 4: exploiting Cramér By Lemma 4.14(ii), enough to prove kxk = kxkp


for each x ∈ Rn such that xi > 0 for all i. (Assume this from now on.) For
w ∈ RJ and t ∈ R, write

N (w, t) = {j ∈ J : wj ≥ t} .

For r ≥ 1, write r
x⊗r = x ⊗ · · · ⊗ x ∈ Rn .
Then Proposition 4.10 states that

kxkp = sup u · N (x⊗r , ur )1/rp


u>0, r≥1

or equivalently
1/r
ur , . . . , ur

kxkp = sup . (9)
u>0,r≥1 | {z }
N (x⊗r ,ur )

Will use this to show kxk ≥ kxkp , then kxk ≤ kxkp .

Step 5: the lower bound First show kxk ≥ kxkp . By (9), enough to show
that for each u > 0 and r ≥ 1,
1/r
ur , . . . , u r

kxk ≥ .
| {z }
N (x⊗r ,ur )

By multiplicativity, this is equivalent to

kx⊗r k ≥ ur , . . . , ur .

| {z }
N (x⊗r ,ur )

But this is clear, since by Lemma 4.14(iii),

kx⊗r k ≥ ur , . . . , ur , 0, . . . , 0 = ur , . . . , ur .
 
| {z } | {z }
N (x⊗r ,ur ) N (x⊗r ,ur )
| {z }
nr

29
Step 6: the upper bound Now prove kxk ≤ θkxkp for each θ > 1. Since
mini xi > 0, can choose u0 , . . . , ud such that

min xi = u0 < u1 < · · · < ud = max xi


i i

and uk /uk−1 < θ for all k ∈ {1, . . . , d}.


r
For each r ≥ 1, define yr ∈ Rn to be x⊗r with each coordinate rounded up
to the next one in the set {u1 , . . . , urd }. Then x⊗r ≤ yr coordinatewise, giving
r

kx⊗r k ≤ kyr k
= k(ur1 , . . . , ur1 , . . . , urd , . . . , urd )k

where the number of terms urk is ≤ N (x⊗r , urk−1 ). So

d
X
kx⊗r k ≤ urk , . . . , urk

| {z }
k=1
≤N (x⊗r ,urk−1 ) terms

ur , . . . , ur

≤ d max
1≤k≤d | k {z k}
≤N (x⊗r ,urk−1 )

≤ dθr max urk−1 , . . . , urk−1



1≤k≤d | {z }
≤N (x⊗r ,urk−1 )
r
≤ dθ kxkrp

by (9). Hence kxk = kx⊗r k1/r ≤ d1/r θkxkp . Letting r → ∞, get kxk ≤ θkxkp .
True for all θ > 1, so kxk ≤ kxkp , completing the proof of Theorem 4.13.

Next week: how to measure biological diversity, and how diversity measures
are related to norms, means and entropy.

5 The diversity of a biological community


Week VIII (28 Mar)
Recap: ∆n , ⊗, H, Sq .
Intro: 50 years of debate; conceptual problem, not just practical or statisti-
cal.

Diversity based on relative abundance


Model a biological community as a finite probability distribution p =
(p1 , . . . , pn ), where n denotes number of species and pi denotes relative abun-
dance (frequency) of ith species.

Basic problem What should we mean by the ‘diversity’ of a community?

30
E.g. consider two communities, A and B. Which is more diverse?

A: more species B: better balance

Definition 5.1 Let q ∈ [0, ∞]. The Hill number (or diversity) of order q
of p is defined for q 6= 1, ∞ by
 X 1
 1−q
Dq (p) = pqi
i : pi >0

and for q = 1, ∞ by taking limits, which gives


1 1
D1 (p) = , D∞ (p) = .
pp11 pp22 · · · ppnn maxi pi

Special cases:
• D0 (p) is species richness, i.e. number of species present. Most common
meaning of ‘diversity’ in both general media and ecological literature.
• D1 (p) = exp(H1 (p)). Doesn’t depend on choice of base for logarithm.

• D2 (p) = 1/ p2i . Keep choosing pairs of individuals from community


P
(with replacement). Then D2 (p) is expected number of trials before you
find two of same species.
• D∞ (p) depends only on most common species. Low if there’s one highly
dominant species.

The parameter q controls emphasis on rare species (low q: more emphasis).


The diversity profile of the community is the graph of Dq (p) against q. E.g.

31
for communities A and B above:

Dq (p) A

Always decreasing and continuous, for reasons we’ll come back to.
Properties of Hill numbers: let q ∈ [0, ∞].
• Dq is an effective number: writing un = (1/n, . . . , 1/n),

Dq (un ) = n.

So if Dq (p) = 17.2, interpret as: our community is slightly more diverse


than a community of 17 equally abundant species.
Reason to use Dq and not log Dq (e.g. D1 rather than Shannon entropy
H). Following an argument of Lou Jost: imagine a continent with a
million equally common species. A plague strikes, entirely killing 99% of
the species and leaving the remaining 1% unchanged. For any of the Dq s,
the diversity drops by 99%, as it should. But the Shannon entropy drops
by only 33%, which is completely misleading!
• Dq (p) has maximum value n, achieved when p = un , and minimum value
1, achieved when p = (0, . . . , 0, 1, 0, . . . , 0).
• Replication principle: take two islands of same size, each with distri-
bution p, but with completely disjoint species. Then diversity of whole is
2Dq (p).

Means To put this in context, helpful to think about means


Definition 5.2 Let p ∈ ∆n and x ∈ (R+ )n . Let t ∈ [−∞, ∞]. The power
mean of x, weighted by p, of order t, is given by
 X 1/t
t
Mt (p, x) = p i xi
i : pi >0

(t 6= 0, ±∞) and in remaining cases by taking limits:


Y p
M−∞ (p, x) = min xi , M0 (p, x) = xi i , M−∞ (p, x) = min xi .
i : pi >0 i : pi >0
i

32
Proposition 5.3 Let p ∈ ∆n and x ∈ (R+ )n . Then Mt (p, x) is (non-strictly)
increasing and continuous in t ∈ [−∞, ∞].

Proof See e.g. Hardy, Littlewood and Pólya’s book Inequalities, Theorem 16.

E.g. M0 (un , x) ≤ M1 (un , x): inequality of arithmetic and geometric means.


Remark 5.4 There are many unique characterization theorems for means
(omitted!). E.g. can use fact that for t ≥ 1,

Mt (un , x) = n−1/p kxkt

to turn Aubrun and Nechita’s characterization of the p-norms into a character-


ization of the power means.

Write 1/p = (1/p1 , . . . , 1/pn ). Then

Dq (p) = M1−q (p, 1/p).

Interpretation: 1/pi measures rarity of ith species. So Dq (p) is average rarity.


Result above implies diversity profiles are decreasing and continuous.

Rényi entropies
Definition 5.5 Let p ∈ ∆n and q ∈ [0, ∞]. The Rényi entropy of p of
order q is
Hq (p) = log Dq (p).

Hill used Rényi’s work, which was earlier.


E.g. H1 = H.
The RényiP entropy Hq (p) and surprise entropy Sq (p) are both invertible
functions of i pqi , hence of each other. So they convey the same information,
presented differently. Rényi entropy has some advantages over surprise entropy:
• Dq = exp Hq is an effective number.
• Clean upper bound (namely, n) whereas upper bound for surprise entropy
is something messy.
• True multiplicativity (unlike Sq ):

Hq (p ⊗ r) = Hq (p) + Hq (r)

or equivalently Dq (p ⊗ r) = Dq (p)Dq (r); taking r = u2 gives replication


principle.
Q. Every linear combination of Hq s is multiplicative in this sense. Every
‘infinite linear combination’, i.e. integral, is too. Are these the only ones?

33
Back to diversity Important feature of Dq is modularity, as follows. Take
n islands of varying sizes, sharing no species. Then the diversity of their union
depends only on the diversities and relative sizes of the islands—not on their
internal structure.
That is: write w = (w1 , . . . , wn ) ∈ ∆n for the relative sizes of the islands.
Write pi = (p11 , . . . , p1k1 ) for the relative abundances of the species on island i.
Then the species distribution for the whole community is

w ◦ (p1 , . . . , pn )

and modularity says that Dq (w ◦ (p1 , . . . , pn )) depends only on w,


Dq (p1 ), . . . , Dq (pn ), not on p1 , . . . , pn themselves. Specifically:
 
Dq (w ◦ (p1 , . . . , pn )) = Dq (w) · M1−q w(q) , Dq (p1 ), . . . , Dq (pn ) (10)

where w(q) is the escort distribution


(w1q , . . . , wnq )
w(q) = .
w1q + · · · + wnq

In terminology of a few weeks ago, this is a ‘chain rule’. Disadvantage of Rényi


form over surprise form: chain rule is less obvious.
I didn’t write up the following remarks and theorem in class, although I said
something about them out loud.
Remarks 5.6 i. The terminology ‘escort distribution’ is taken from ther-
modynamics, in which we see expressions such as

(e−βE1 , . . . , e−βEn )
Z(β)

where Z(β) = e−βE1 + · · · + e−βEn is the partition function for energies


Ei at inverse temperature β.
ii. The function (q, w) 7→ w(q) is the scalar multiplication of a real vector
space structure on the simplex ∆n (or actually its interior): see ‘How the
simplex is a vector space’, n-Category Café, 11 June 2016.

It’s a theorem that the Hill numbers are the only diversity measures satisfy-
ing the chain rule (10), at least for q 6= 0, 1, ∞. Maybe not very useful, as why
would anyone ask for this chain rule? Anyway, here’s the result:

Theorem 5.7 Let q ∈ (0, 1) ∪ (1, ∞). Let (D : ∆n → [1, ∞))n≥1 be a sequence
of functions. TFAE:
i. D is continuous, symmetric, and satisfies the q-chain rule (10);
ii. D = Dq or D ≡ 1.

Proof Omitted. As far as I know, this result doesn’t exist in the literature. I’d
be happy to explain the proof to anyone interested. 

34
Next week: we incorporate similarity between species into our model. This
not only improves biological accuracy; it also leads to interesting new mathe-
matics.

Week IX (4 Apr)
Recap: crude model of community as probability distribution; definition
of Mt (p, x), of Dq (p) as M1−q (p, 1/p) (diversity as average rarity, with 1/p
defined coordinatewise) and explicit form, and Hq (p) = log Dq (p).

Similarity-sensitive diversity
Ref: Leinster and Cobbold, Ecology, 2012.
Model: community of n species, with:
• relative abundances p = (p1 , . . . , pn ) ∈ ∆n ;
• for each i, j, a similarity coefficient Zij ∈ [0, 1], giving matrix Z = (Zij ).
Assume Zii = 1 for all i.
Examples 5.8 i. Z = I: then different species have nothing in common
(naive model).
ii. Zij = percentage genetic similarity between species i and j.
iii. Or measure similarity phylogenetically, taxonomically, morphologically,
...
iv. Given metric d on {1, . . . , n}, put Zij = e−d(i,j) .
v. (Non-biological?) Given reflexive graph with vertices 1, . . . , n, put
(
1 if there is an edge between i and j
Zij =
0 otherwise.
Reflexive means that there is an edge from each vertex to itself.
Viewing p as a column vector, we have
X
(Zp)i = Zij pj
j

—‘ordinariness’ of species i within our community. It’s the expected similarity


between an individual of species i and an individual chosen from the community
at random. So 1/(Zp)i is ‘specialness’.
Previously, defined diversity Dq (p) as average rarity, where rarity is 1/pi .
That was ignoring inter-species similarities—in effect, using naive model Z = I.
Now replace 1/pi by specialness 1/(Zp)i , which is a kind of rarity that takes
into account the proximity of other species.
Definition 5.9 Let q ∈ [0, ∞]. The diversity of order q of the community is
 1/(1−q)
pi (Zp)iq−1
P
if q 6= 1, ∞,




 i : pi >0
DqZ (p) = M1−q (p, 1/Zp) = 1 Q(Zp)pi if q = 1,

 i
1/ max (Zp)i if q = ∞.


i : pi >0

35
Examples 5.10 i. Naive model Z = I: have DqI (p) = Dq (p). So the
classical quantities Dq (p)—the exponentials of Shannon entropy etc.—
have a big problem: they implicitly assume that different species have
nothing in common. This is of course nonsense.
ii. D2Z (p) = 1/ i,j pi Zij pj . This is the reciprocal of expected similarity
P
between a randomly-chosen pair of individuals. You can measure this even
in situations (e.g. microbial) where you don’t have a species classification
at all. Something similar can be done for all integers q ≥ 2.
iii. Generally: incorporating similarity can change judgement on when one
community is more diverse than another. See our paper for examples.
Broad idea: it may be that one community initially looks more diverse,
but all the diversity is within a group of species that are highly similar,
in which case it’s really not so diverse after all.

Properties of DqZ , for each fixed q (and here I’ll skip some obvious ones):
• Splitting a species into two identical species doesn’t change the diversity.
Mathematically: add a new column to the matrix identical to the last one,
and a new row identical to the last one. Also split pn as a sum of two
probabilities however you like.
Consequence: by continuity, splitting a species into two highly similar
species causes only a slight increase in diversity. This is sensible, logical
behaviour. Species classifications can be somewhat arbitrary and change-
able.
• Modularity: take m islands, such that species on different islands are
distinct and totally dissimilar. Then the diversity of the whole depends
only onPthe diversities di and relative sizes wi of the islands. (‘Relative’
means wi = 1.) The important thing here is the functional relationship
just stated. But I’ll also give the actual formula. Explicitly, the diversity
of the whole is  X 1/(1−q)
wiq d1−q
i
i : wi >0

(q 6= 1, ∞) and similar formula for q = 1, ∞. This is a kind of chain rule.


• Replication: if the m islands have the same diversity d and the same
size, the overall diversity is md.
• Monotonicity: increasing similarities decreases diversity. In particular,
DqZ (p) ≤ DqI (p). Interpretation: if a measure knows nothing of the com-
monalities between species, it will evaluate the community as more diverse
than it really is. The naive model overestimates diversity.
• Range: diversity is between 1 and n. Minimum occurs when only one
species is present. Come back to maximization in a minute.
Remark 5.11 Do these properties (plus some obvious ones) uniquely charac-
terize the diversity measures DqZ ? No! The problem is that Z isn’t really pinned
down. . . Come back to this next time, when we’ll pass to a more general setting
that makes life simpler.

36
Digression: entropy viewpoint Define

HqZ (p) = log DqZ (p)

—‘similarity-sensitive Rényi entropy’. Gives (for each q) a notion of the entropy


of a distribution on a metric space (or graph).
Examples 5.12 i. HqI is usual Rényi entropy Hq . Taking Z = I corrre-
sponds to metric with d(i, j) = ∞ for all i 6= j.
ii. When q = 1, get a metric or similarity-sensitive version of Shannon en-
tropy: X
Hq1 (p) = − pi log(Zp)i .
i : pi >0

Maximizing diversity Suppose we fix a list of species and can control their
relative abundances. What relative abundance distribution would we choose
in order to maximize the diversity, and what is the value of that maximum
diversity?
Or more mathematically: fix a finite metric space. Which probability dis-
tribution(s) maximize the diversity (or equivalently entropy), and what is the
value of the maximum diversity?
Fix q and a similarity matrix Z. Questions:
A Which distribution(s) p maximize DqZ (p)?

B What is the maximum diversity, sup DqZ (p)?


p∈∆n

Examples 5.13 i. For Z = I, maximum diversity is n (for any q), achieved


when p = un (uniform distribution).
ii. Take two very similar species and one quite different species:

1
3
2

Maximizing distribution should be close to (0.25, 0.25, 0.5).

Recall that even in the case Z = I, different values of q give different judge-
ments on which communities are more diverse than which others. They rank
distributions p differently. So: In principle, the answers to both A and B
depend on q. But:
Theorem 5.14 (with Mark Meckes) Let Z be a symmetric similarity ma-
trix. Then:
A There is a distribution p that maximizes DqZ (p) for all q ∈ [0, ∞] simul-
taneously.
B supp DqZ (p) is independent of q ∈ [0, ∞].

Proof Omitted; Leinster and Meckes, arXiv:1512.06314. 

37
So, there is a best of all possible worlds.
Remarks 5.15 i. For Z = I, trivial: example above.

ii. Can define the maximum diversity of a similarity matrix Z to be


supp DqZ (p) (which is independent of q). So we’ve got a real invariant
of similarity matrices. It’s something new!
iii. In particular, can define the maximum diversity of a finite metric space
to be the maximum diversity of (e−d(i,j) )i,j . Closely related to another
real invariant, magnitude, which is geometrically meaningful.
iv. For graphs (with Z as above), the maximum diversity is equal to the inde-
pendence number (dual of clique number). A set of vertices in a graph is
independent if no two of them are linked by an edge; the independence
number of a graph is the maximum cardinality of an independent set of
vertices.
v. There is an algorithm for calculating maximum diversity and all maxi-
mizing distributions. It takes 2n steps. Assuming P 6= NP, there is no
polynomial-time algorithm. This is known for independence numbers of
graphs, so it follows for maximum diversity generally.

vi. Any distribution p that maximizes DqZ (p) for one q > 0 actually max-
imizes DqZ (p) for all q ∈ [0, ∞]. (Not obvious, but in the paper with
Meckes.)

Next time: I’ll describe a more general family of measures and prove that in
a certain sense, it is the only sensible way of assigning a value to a collection of
things. I don’t presuppose that this ‘value’ is any kind of diversity, nor that there
is anything ecological about the setting. But it turns out that it’s closely related
to the diversity measures that we just met, and that it’s especially interesting
to interpret these ‘value’ measures in an ecological setting.

38

You might also like