I. Entropy in Statistical Mechanics.: 03. Boltzmann Entropy, Gibbs Entropy, Shannon Information
I. Entropy in Statistical Mechanics.: 03. Boltzmann Entropy, Gibbs Entropy, Shannon Information
Goal: To explain the behavior of macroscopic systems in terms of the dynamical laws governing their microscopic consituents. In particular: To provide a micro-dynamical explanation of the 2nd Law. 1. Boltzmann's Approach. Consider different "macrostates" of a gas:
Ludwig Boltzmann
Why does the gas prefer to be in the equilibrium macrostate (last one)?
Suppose the gas consists of N identical particles governed by Hamilton's equations of motion (the micro-dynamics). A microstate X of the gas is a specification of the position (3 values) and momentum (3 values) for each of its N particles.
= phase space = 6N-dim space of all possible microstates. E = region of that consists of all microstates with constant energy E.
Hamiltonian dynamics maps initial microstate Xi to final microstate Xf. Can 2nd Law be explained by recourse to this dynamics?
Xi Xf
A macrostate of the gas is a specification of the gas in terms of macroscopic properties (pressure, temperature, volume, etc.). Relation between microstates and macrostates: Macrostates supervene on microstates!
To each microstate there corresponds exactly one macrostate. Many distinct microstates can correspond to the same macrostate.
So: E is partitioned into a finite number of regions corresponding to macrostates, with each microstate X belonging to one macrostate (X).
microstates
X (X)
macrostate
Claim #1: Different macrostates have vastly different sizes. Quantify this by the Boltzmann Entropy: SB((X)) = k log|(X)|
|(X)| = the volume (size) of (X)
What this means: The greater SB, the larger the region of phase space in which X is located. Claim #2: The equilibrium macrostate eq is vastly larger than any other macrostate (in other words, SB obtains its maximum value for eq).
eq
E
very small nonequilibrium macrostates
Thus: SB increases over time because, for any initial microstate Xi, the dynamics will map Xi into eq very quickly, and then keep it there for an extremely long time.
Two Ways to Explain the Approach to Equilibrium: (a) Appeal to Typicality For large N, E is almost entirely filled up with equilibrium microstates. Thus: A system approaches equilibrium because equilibrium microstates are typical and nonequilibrium microstates are atypical.
But: What is it about the dynamics that evolves atypical states to typical states? "If a system is in an atypical microstate, it does not evolve into an equilibrium microstate just because the latter is typical." (Frigg 2009) Need to identify properties of the dynamics that guarantee atypical states evolve into typical states. And: Need to show that these properties are typical. Example (Frigg 2009): If the dynamics is chaotic (in an appropriate sense), then (under certain conditions), any initial microstate Xi will quickly be mapped into eq and remain there for long periods of time.
(b) Appeal to Probabilities Associate probabilities with macrostates: the larger the macrostate, the greater the probability of finding a microstate in it. Thus: A system approaches equilibrium because it evolves from states of lower toward states of higher probability, and the equilibrium state is the state of highest probabililty.
"In most cases, the initial state will be a very unlikely state. From this state the system will steadily evolve towards more likely states until it has finally reached the most likely state, i.e., the state of thermal equilibrium."
w1
P6
w2
w3
P89
Start with the phase space of a single particle. Partition into cells w1, w2, ..., wk of size w. A state of an N-particle system is given by N points in . An arrangement is a specification of which points lie in which cells.
w1
P89
w2
w3
P6
Arrangement #2: state of P89 in w1, state of P6 in w3, etc. Distribution: (1, 0, 2, 0, 1, 1, ...)
Start with the phase space of a single particle. Partition into cells w1, w2, ..., wk of size w. A state of an N-particle system is given by N points in . An arrangement is a specification of which points lie in which cells. A distribution is a specification of how many points (regardless of which ones) lie in each cell. Takes form (n1, n2, ..., nk), where ni = # of points in wi. Note: More than one arrangement can correspond to the same distribution.
How many arrangements G(Di) are compatible with a given distribution Di = (n1, n2, ..., nk)? Answer:
Check: - Let D1 = (N, 0, ..., 0) and D2 = (N1, 1, 0, ..., 0). - G(D1) = N!/N! = 1. (Only one way for all N particles to be in w1.) - G(D2) = N!/(N1)! = N(N1)(N2)1/(N1)(N2)1 = N. (There are N different ways w2 could have one point in it; namely, if P1 was in it, or if P2 was in it, or if P3 was in it, etc...) "The probability of this distribution [Di] is then given by the number of permutations of which the elements of this distribution are capable, that is by the number [G(Di)]. As the most probable distribution, i.e., as the one corresponding to thermal equilibrium, we again regard that distribution for which this expression is maximal..."
G(Di ) =
N! n1 !n2 !nk !
= G(Di)wN So: The Boltzmann entropy of Di is given by SB(Di) = k log(G(Di)wN) = k log(G(Di)) + Nk log(w) = k log(G(Di)) + const.
N! +const. = k log n !n !n ! 1 2 k
= k log(N!) k log(n1!) ... k log(nk!) + const.
(Nk logN N) (n1k logn1 + n1) ... (nk lognk + nk) + const.
= k n j logn j + const.
j
n1 + ... + nk = N
Let: pj = nj/N =
Then:
Intuitively: The biggest value of SB is for the distribution Di for which the pj's are all 1/N; i.e., the equilibrium distribution.
2. Gibbs' Approach. Boltzmann's Schtick: analysis of a single system. Each point of phase space represents a possible state of the system. Gibbs' Schtick: analysis of an ensemble of infinitely many systems. Each point of phase space represents a possible state of one member of the ensemble. The state of the entire ensemble is given by a density function (x, t) on . (x, t)dx is the number of systems in the ensemble whose states lie in the region (x, x + dx). pt (R) =
Willard Gibbs
(x,t)dx =
SG () = k (x,t)log((x,t))dx
Interpretive Issues: Why do low-probability states evolve into high-probability states? Characterizations of the dynamics are, again, required to justify this. How are the probabilities to be interpreted? Ontic probabilities = properties of physical systems - Long run frequencies? - Single-case propensities? Epistemic probabilities = measures of degrees of belief - Objective (rational) degrees of belief? - Subjective degrees of belief?
The less likely a message is, the more info gained upon its reception! Let X = {x1, x2, ..., xk} = set of messages. A probability distribution P on X is an assignment P = (p1, p2, ..., pk) of a probability pi = p(xi) to each message xi. Recall: This means pi 0 and p1 + p2 + ... + pk = 1.
A measure of information for X is a real-valued function H(X) : {prob. distributions on X} R, that satisfies:
Continuity. H(p1, ..., pk) is continuous. Additivity. H(p1q1, ..., pkqk) = H(P) + H(Q), for probability distributions P, Q. Monoticity. Info increases with k for uniform distributions: If > k, then H(Q) > H(P), for any P = (1/k, ..., 1/k) and Q = (1/, ..., 1/). Branching. H(p1, ..., pk) is independent of how the process is divided into parts. Bit normalization. The average info gained for two equally likely messages is one bit: H(1/2, 1/2) = 1.
Claim (Shannon 1949): There is exactly one function that satisfies these criteria; namely, the Shannon Entropy (or "Shannon Information"):
H(X ) = pi log pi
i
H(X) is maximal for p1 = p2 = ... = pk = 1/k. H(X) = 0 just when one pi is 1 and the rest are 0. Logarithm is to base 2: logx = y x = 2y.
1. H(X) as Maximum Amount of Message Compression Let X = {x1, ..., xn} be a set of letters from which we construct the messages. Suppose the messages have N letters a piece. The probability distribution P = (p1, ..., pn) is now over the letter set. Then: A typical sequence of letters will contain p1N occurrences of x1, p1N occurrences of x1, etc. Thus:
The number of distinct typical sequences of letters
So: log
The number of distinct typical sequences of letters
= NH(X)
= 2NH(X)
So: There are only 2NH(X) possible messages. This means we can encode them using only NH(X) bits.
Check: - 2 possible messages require 1 bit: 0, 1. - 4 possible messages require 2 bits: 00, 01, 10, 11. - etc.
So: If we need logn bits for each letter, we'll need Nlogn bits for a sequence of N letters. Thus: Instead of requiring Nlogn bits to encode our messages, we can get by with only NH(X) bits. Thus: H(X) represents the maximum amount that messages drawn from a given set of letters can be compressed.
2. H(X) as a Measure of Uncertainty Suppose P = (p1, ..., pn) is a probability distribution over a set of values {x1, ..., xn} of a random variable X. The expected value E(X) of X is given by E(X ) = pi x i .
i=1 n
Let logpi be the information gained if X is measured to have the value xi.
Recall: The greater pi is, the more certain xi is, and the less information should be associated with it.
What this means: H(X) tells us our expected information gain upon measuring X.
3. Conditional Entropy A communication channel is a device with a set of input states X = {x1, ..., xn} that are mapped to a set of output states Y = {y1, ..., yn}. Given probability distributions p(xi) and p(yi) on X and Y, and a joint distribution p(xi yi), the conditional probability p(xi|yj) of input xi, given output yj, is defined by
p(x i |y j ) =
p(x i y j ) p(y j )
Suppose we use our channel a very large number N of times. Then: There will be 2NH(X) typical input strings. And: There will be 2NH(X) typical output strings. Thus: There will be 2NH(X Y) typical strings of X and Y values. So:
# of typical input strings that could result in a given output Y
2NH (X Y ) 2NH (Y )
= 2N (H (X Y )H (Y ))
= 2NH (X|Y )
If one is trying to use a noisey channel to send a message, then the conditional entropy specifies the # of bits per letter that would need to be sent by an auxiliary noiseless channel in order to correct all the errors due to noise.
Check Claim:
= H(Y ) + H(X|Y )
p(x
i
y j ) = p(y j )
Interpretive Issues: 1. How should the probabilities p(xi) be interpreted? Emphasis is on uncertainty: The information content of a message xi is a function of how uncertain it is, with respect to the receiver. So: Perhaps the probabilities are epistemic. In particular: p(xi) is a measure of the receiver's degree of belief in the accuracy of message xi. But: The probabilities are set by the nature of the source. If the source is not probabilistic, then p(xi) can be interpreted epistemically. If the source is inherently probabilistic, then p(xi) can be interpreted as the ontic probability that the source produces message xi.
2. How is Shannon Information/Entropy related to other notions of entropy? Thermodynamic entropy: Boltzmann entropy: Gibbs entropy: Shannon entropy:
S = S f Si =
QR T
SG () = k (x,t)log((x,t))dx
H(X ) = pi log pi
i
Can statistical mechanics be given an information-theoretic foundation? Can the 2nd Law be given an information-theoretic foundation?