Probability and Information Theory PDF
Probability and Information Theory PDF
I. WHY PROBABILITY?
In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain
one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a com-
plex rule. For example, the simple rule “Most birds fly” is cheap to develop and is broadly useful, while a
rule of the form, “Birds y, except for very young birds that have not yet learned to y, sick or injured birds
that have lost the ability to y, ightless species of birds including the cassowary, ostrich and kiwi. . .” is ex-
pensive to develop, maintain and communicate and, after all this effort, is still brittle and prone to failure.
Probability theory was originally developed to analyze the frequencies of events. It is easy to see how
probability theory can be used to study events like drawing a certain hand of cards in a poker game.
These kinds of events are often repeatable. When we say that an outcome has a probability p of occurring,
it means that if we repeated the experiment (e.g., drawing a hand of cards) infinitely many times, then
proportion p of the repetitions would result in that outcome. Probability can be seen as the extension
of logic to deal with uncertainty. Logic provides a set of formal rules for determining what propositions
are implied to be true or false given the assumption that some other set of propositions is true or false.
Probability theory provides a set of formal rules for determining the likelihood of a proposition being
true given the likelihood of other propositions.
Information theory is a branch of applied mathematics that revolves around quantifying how much infor-
mation is present in a signal. It was originally invented to study sending messages from discrete alphabets
over a noisy channel, such as communication via radio transmission. In this context, information theory
tells how to design optimal codes and calculate the expected length of messages sampled from specic
2
probability distributions using various encoding schemes. In the context of machine learning, we can
also apply information theory to continuous variables where some of these message length interpreta-
tions do not apply. This field is fundamental to many areas of electrical engineering and computer science.
Information theory plays an important role in various areas of machine learning, in particular
• Feature selection: uses the information theoretic notions of information gain and mutual information
• Learning algorithms, such as the decision tree induction, use the information theoretic concept of
entropy to decide how to partition the document space.
Information theory originated from Shannon’s research on the capacity of noisy information channels.
Information theory is concerned with maximizing the information one can transmit over an imperfect
communication channel. The central concept of information theory is that of entropy. Entropy measures
“the amount of uncertainty” in a probability distribution.
A. Self-Information
Let E be an event with probability P r(E), and let I(E) represent the amount of information you gain
when you learn that E has occurred (or equivalently, the amount of uncertainty you lose after learning
that E has happened). Then a natural question to ask is “what properties should I(E) have?” Here are
some common properties that I(E), which is called the self-information, is reasonably expected to have.
• I(E) should be a function of P r(E).
• I(P r(E)) should be continuous in P r(E) where I() is a function defined over the interval [0, 1].
• If E1 and E2 are independent events, then I(E1 ∩ E2 ) = I(E1 ) + I(E2 ).
The only function that satisfies properties 1, 2 and 3 is the logarithmic function.
B. Entropy
Entropy is a measure of the amount of information (or uncertainty) contained in the source.
where the source alphabet X is finite, and PX (x) is the probability distribution of the source X for x ∈ X.
By the above definition, entropy can be interpreted as the expected or average amount of self-information
you gain when you learn that one of the |X| outcomes has occurred, where |X| is the cardinality of X.
Another interpretation is that H(X) is a measure of uncertainty of random variable X. As an example,
Let X be a random variable with PX (1) = p and PX (0) = 1p. Then