Shannon Entropy
Shannon Entropy
Contents
1 Introduction
2 Formal definitions
2.1 Relationship to thermodynamic entropy
2.2 Entropy as information content
2.3 Data compression
2.4 Limitations of entropy as information content
2.5 Data as a Markov process
2.6 Alternative definition
3 Efficiency
4 Derivation of Shannon's entropy
5 Properties of Shannon's information entropy
6 Extending discrete entropy to the continuous case:
differential entropy
7 References
8 See also
9 External links
Introduction
The concept of entropy in information theory describes how much information there is in a signal or
event. Shannon introduced the idea of information entropy in his 1948 paper "A Mathematical Theory of
Communication".
An intuitive understanding of information entropy relates to the amount of uncertainty about an event
associated with a given probability distribution. As an example, consider a box containing many coloured
balls. If the balls are all of different colours and no colour predominates, then our uncertainty about the
colour of a randomly drawn ball is maximal. On the other hand, if the box contains more red balls than
any other colour, then there is slightly less uncertainty about the result: the ball drawn from the box has
more chances of being red (if we were forced to place a bet, we would bet on a red ball). Telling someone
the colour of every new drawn ball provides them with more information in the first case than it does in
the second case, because there is more uncertainty about what might happen in the first case than there is
in the second. Intuitively, if we know the number of balls remaining, and they are all of one color, then
there is no uncertainty about what the next ball drawn will be, and therefore there is no information
content from drawing the ball. As a result, the entropy of the "signal" (the sequence of balls drawn, as
calculated from the probability distribution) is higher in the first case than in the second.
Shannon, in fact, defined entropy as a measure of the average information content associated with a
random outcome.
Shannon's definition of information entropy makes this intuitive distinction mathematically precise. His
definition satisfies these desiderata:
The measure should be continuous i.e., changing the value of one of the probabilities by a very
small amount should only change the entropy by a small amount.
If all the outcomes (ball colours in the example above) are equally likely, then entropy should be
maximal. In this case, the entropy increases with the number of outcomes.
If the outcome is a certainty, then the entropy should be zero.
The amount of entropy should be the same independently of how the process is regarded as being
divided into parts.
(Note: The Shannon/Weaver book makes reference to Tolman (1938) who in turn credits Pauli (1933)
with the definition of entropy Shannon used. Elsewhere in statistical mechanics, the literature includes
references to von Neumann having derived the same form of entropy in 1927, which may explain why
von Neumann favoured the use of the existing term 'entropy'.)
Formal definitions
Shannon defines entropy in terms of a discrete random variable X, with possible states (or outcomes)
x1...xn as:
where
is the probability of the ith outcome of X.
That is, the entropy of the variable X is the sum, over all possible outcomes xi of X, of the product of the
probability of outcome xi times the log of the inverse of the probability of xi (which is also called xi's
surprisal - the entropy of X is the expected value of its outcome's surprisal). We can also apply this to a
general probability distribution, rather than a discrete-valued event.
Shannon shows that any definition of entropy satisfying his assumptions will be of the form:
structures. For example, redundancy in language structure or statistical properties relating to the
occurrence frequencies of letter or word pairs, triplets etc. See Markov chain.
Data compression
Entropy effectively bounds the performance of the strongest lossless (or nearly lossless) compression
possible, which can be realized in theory by using the typical set or in practice using Huffman, LempelZiv or arithmetic coding. The performance of existing data compression algorithms is often used as a
rough estimate of the entropy of a block of data.
symbols as two-character blocks, then the entropy rate is 0 bits per character.
However, if we use very large blocks, then the estimate of per-character entropy rate may become
artificially low. This is because in reality, the probability distribution of the sequence is not knowable
exactly; it is only an estimate. For example, suppose one considers the text of every book ever published
as a sequence, with each symbol being the text of a complete book. If there are N published books, and
each book is only published once, the estimate of the probability of each book is 1/N, and the entropy (in
bits) is -log2 N. As a practical code, this corresponds to assigning each book a unique identifier and using
it in place of the text of the book whenever one wants to refer to the book. This is enormously useful for
talking about books, but it is not so useful for characterizing the information content of an individual
book, or of language in general: it is not possible to reconstruct the book from its identifier without
knowing the probability distribution, that is, the complete text of all the books. The key idea is that the
complexity of the probabilistic model must be considered. Kolmogorov complexity is a theoretical
generalization of this idea that allows the consideration of the information content of a sequence
independent of any particular probability model; it considers the shortest program for a universal
computer that outputs the sequence. A code that achieves the entropy rate of a sequence for a given
model, plus the codebook (i.e. the probabilistic model), is one such program, but it may not be the
shortest.
where pi is the probability of i. For a first-order Markov source (one in which the probability of selecting
a character is dependent only on the immediately preceding character), the entropy rate is:
where i is a state (certain preceding characters) and pi(j) is the probability of j given i as the previous
character (s).
For a second order Markov source, the entropy rate is
In general the b-ary entropy of a source = (S,P) with source alphabet S = {a1, , an} and discrete
probability distribution P = {p1, , pn} where pi is the probability of ai (say pi = p(ai)) is defined by:
Note: the b in "b-ary entropy" is the number of different symbols of the "ideal alphabet" which is being
used as the standard yardstick to measure source alphabets. In information theory, two symbols are
necessary and sufficient for an alphabet to be able to encode information, therefore the default is to let b =
2 ("binary entropy"). Thus, the entropy of the source alphabet, with its given empiric probability
distribution, is a number equal to the number (possibly fractional) of symbols of the "ideal alphabet", with
an optimal probability distribution, necessary to encode for each symbol of the source alphabet. Also note
that "optimal probability distribution" here means a uniform distribution: a source alphabet with n
symbols has the highest possible entropy (for an alphabet with n symbols) when the probability
distribution of the alphabet is uniform. This optimal entropy turns out to be
.
Alternative definition
Another way to define the entropy function H (not using the Markov model) is by proving that H is
uniquely defined (as earlier mentioned) if and only if H satisfies the following conditions:
1. H(p1, , pn) is defined and continuous for all p1, , pn where pi
[0,1] for all i = 1, , n and p1 +
+ pn = 1. (Remark that the function solely depends on the probability distribution, not the alphabet.)
2. For all positive integers n, H satisfies
This last functional relationship characterizes the entropy of a system with sub-systems and is in a sense
the most important of the three. It demands that the entropy of a system can be calculated from the
entropy of its sub-systems if we know how the sub-systems interract with each other.
Assume that we have an ensemble of n elements with a uniform distribution on them. If we mentally
divide this ensemble into k boxes (sub-systems) with bi elements in each, the entropy can be calculated as
a sum of individual entropies of the boxes weighed by the probability of finding oneself in that particular
box PLUS the entropy of the system of boxes.
Efficiency
A source alphabet encountered in practice should be found to have a probability distribution which is less
than optimal. If the source alphabet has n symbols, then it can be compared to an "optimized alphabet"
with n symbols, whose probability distribution is uniform. The ratio of the entropy of the source alphabet
with the entropy of its optimized version is the efficiency of the source alphabet, which can be expressed
as a percentage.
This implies that the efficiency of a source alphabet with n symbols can be defined simply as being equal
to its n-ary entropy. See also Redundancy (information theory).
Q. Given a roulette with n pockets which are all equally likely to be landed on by the ball, what is the
probability of obtaining a distribution (A1, A2, , An) where Ai is the number of times pocket i was
landed on and
where
is the number of possible combinations of outcomes (for the events) which fit the given distribution, and
is the number of all possible combinations of outcomes for the set of P events.
Q. And what is the entropy?
A. The entropy of the distribution is obtained from the logarithm of :
So the entropy is
and the term (1 n) can be dropped since it is a constant, independent of the px distribution. The result is
.
Thus, the Shannon entropy is a consequence of the equation
For any n, Hn(p1,...,pn) is a continuous and symmetric function on variables p1, p2,...,pn.
Event of probability zero does not contribute to the entropy, i.e. for any n,
.
If
.
If we partition the mn outcomes of the random experiment into m groups with each group containing n
elements, we can do the experiment in two steps: first, determine the group to which the actual outcome
belongs; then, find the outcome in that group. The probability that you will observe group i is qi. The
conditional probability distribution function for group i is pi1/qi,...,pin/qi). The entropy
is the entropy of the probability distribution conditioned on group i. This property means that the total
information is the sum of the information gained in the first step, Hm(q1,..., qn), and a weighted sum of
the entropies conditioned on each group.
Khinchin in 1957 showed that the only function satisfying the above assumptions is of the form
,
where k is a positive constant representing the desired unit of measurement.
where f denotes a probability density function on the real line, is analogous to the Shannon entropy and
could thus be viewed as an extension of the Shannon entropy to the domain of real numbers. Formula (*)
is usually referred to as the continuous entropy, or differential entropy. Although the analogy between
both functions is suggestive, the following question must be set: is the Boltzmann entropy a valid
extension of the Shannon entropy? To answer this question, we must establish a connection between the
two functions:
We wish to obtain a generally finite measure as the bin size goes to zero. In the discrete case, the bin size
is the (implicit) width of each of the n (finite or infinite) bins whose probabilities are denoted by pn. As
we generalize to the continuous domain, we must make this width explicit.
To do this, start with a continuous function f discretized as shown in the figure. As the figure indicates, by
the mean-value theorem there exists a value xi in each bin such that
and thus the integral of the function f can be approximated (in the Riemannian sense) by
where this limit and bin size goes to zero are equivalent.
We will denote
As
, we have
and so
as
which is, as said before, referred to as the differential entropy. This means that the differential entropy is
not a limit of the Shannon entropy for n
It turns out as a result that, unlike the Shannon entropy, the differential entropy is not in general a good
measure of uncertainty or information. For example, the differential entropy can be negative; also it is not
invariant under continuous co-ordinate transformations.
More useful for the continuous case is the relative entropy of a distribution, defined as the KullbackLeibler divergence from the distribution to a reference measure m(x),
The relative entropy carries over directly from discrete to continuous distributions, and is invariant under
co-ordinate reparametrisations.
References
This article incorporates material from Shannon's entropy on PlanetMath, which is licensed under the
GFDL.
See also
Binary entropy function - the entropy of a Bernoulli trial with probability of success p
Conditional entropy
Cross entropy is a measure of the average number of bits needed to identify an event from a set
of possibilities between two probability distributions
Joint entropy - is the measure how much entropy is contained in a joint system of two random
variables.
Entropy encoding - a coding scheme that assigns codes to symbols so as to match code lengths
with the probabilities of the symbols.
Kolmogorov-Sinai entropy in dynamical systems
Rnyi entropy - a generalisation of information entropy; it is one of a family of functionals for
quantifying the diversity, uncertainty or randomness of a system.
Perplexity
Quantum relative entropy - a measure of distinguishability between two quantum states.
Theil index
External links