0% found this document useful (0 votes)
56 views

Probability and Information Theory PDF

1) Probability theory provides a mathematical framework for representing and quantifying uncertainty. It allows reasoning about uncertain statements and deriving new uncertain statements using axioms. 2) Information theory quantifies how much information is present in a signal and studies optimal encoding schemes for sending messages over noisy channels. 3) Entropy is a central concept in information theory that measures the uncertainty in a probability distribution, and can be interpreted as the expected self-information or amount of information gained from an outcome.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Probability and Information Theory PDF

1) Probability theory provides a mathematical framework for representing and quantifying uncertainty. It allows reasoning about uncertain statements and deriving new uncertain statements using axioms. 2) Information theory quantifies how much information is present in a signal and studies optimal encoding schemes for sending messages over noisy channels. 3) Entropy is a central concept in information theory that measures the uncertainty in a probability distribution, and can be interpreted as the expected self-information or amount of information gained from an outcome.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1

Probability and Information Theory


Johar M. Ashfaque
Probability theory is a fundamental tool of many disciplines of science and engineering. It is a mathemat-
ical framework for representing uncertain statements. It provides the means of quantifying uncertainty
as well as axioms for deriving new uncertain statements. In articial intelligence applications, we use
probability theory in two major ways. First, the laws of probability tell us how AI systems should reason,
so we design our algorithms to compute or approximate various expressions derived using probability
theory. Second, we can use probability and statistics to theoretically analyze the behavior of proposed
AI systems.

I. WHY PROBABILITY?

There are three possible sources of uncertainty:


• Inherent stochasticity in the system being modeled. For example, most interpretations of quantum
mechanics describe the dynamics of subatomic particles as being probabilistic. We can also create
theoretical scenarios that we postulate to have random dynamics, such as a hypothetical card game
where we assume that the cards are truly shuffled into a random order.
• Incomplete observability. Even deterministic systems can appear stochastic when we cannot observe
all the variables that drive the behavior of the system. For example, in the Monty Hall problem, a
game show contestant is asked to choose between three doors and wins a prize held behind the chosen
door. Two doors lead to a goat while a third leads to a car. The outcome given the contestant’s
choice is deterministic, but from the contestant’s point of view, the outcome is uncertain.
• Incomplete modeling. When we use a model that must discard some of the information we have
observed, the discarded information results in uncertainty in the model’s predictions. For example,
suppose we build a robot that can exactly observe the location of every object around it. If the
robot discretizes space when predicting the future location of these objects, then the discretization
makes the robot immediately become uncertain about the precise position of objects: each object
could be anywhere within the discrete cell that it was observed to occupy.

In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain
one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a com-
plex rule. For example, the simple rule “Most birds fly” is cheap to develop and is broadly useful, while a
rule of the form, “Birds y, except for very young birds that have not yet learned to y, sick or injured birds
that have lost the ability to y, ightless species of birds including the cassowary, ostrich and kiwi. . .” is ex-
pensive to develop, maintain and communicate and, after all this effort, is still brittle and prone to failure.

Probability theory was originally developed to analyze the frequencies of events. It is easy to see how
probability theory can be used to study events like drawing a certain hand of cards in a poker game.
These kinds of events are often repeatable. When we say that an outcome has a probability p of occurring,
it means that if we repeated the experiment (e.g., drawing a hand of cards) infinitely many times, then
proportion p of the repetitions would result in that outcome. Probability can be seen as the extension
of logic to deal with uncertainty. Logic provides a set of formal rules for determining what propositions
are implied to be true or false given the assumption that some other set of propositions is true or false.
Probability theory provides a set of formal rules for determining the likelihood of a proposition being
true given the likelihood of other propositions.

II. INFORMATION THEORY

Information theory is a branch of applied mathematics that revolves around quantifying how much infor-
mation is present in a signal. It was originally invented to study sending messages from discrete alphabets
over a noisy channel, such as communication via radio transmission. In this context, information theory
tells how to design optimal codes and calculate the expected length of messages sampled from specic
2

probability distributions using various encoding schemes. In the context of machine learning, we can
also apply information theory to continuous variables where some of these message length interpreta-
tions do not apply. This field is fundamental to many areas of electrical engineering and computer science.

Information theory plays an important role in various areas of machine learning, in particular
• Feature selection: uses the information theoretic notions of information gain and mutual information
• Learning algorithms, such as the decision tree induction, use the information theoretic concept of
entropy to decide how to partition the document space.

Information theory originated from Shannon’s research on the capacity of noisy information channels.
Information theory is concerned with maximizing the information one can transmit over an imperfect
communication channel. The central concept of information theory is that of entropy. Entropy measures
“the amount of uncertainty” in a probability distribution.

A. Self-Information

Let E be an event with probability P r(E), and let I(E) represent the amount of information you gain
when you learn that E has occurred (or equivalently, the amount of uncertainty you lose after learning
that E has happened). Then a natural question to ask is “what properties should I(E) have?” Here are
some common properties that I(E), which is called the self-information, is reasonably expected to have.
• I(E) should be a function of P r(E).
• I(P r(E)) should be continuous in P r(E) where I() is a function defined over the interval [0, 1].
• If E1 and E2 are independent events, then I(E1 ∩ E2 ) = I(E1 ) + I(E2 ).
The only function that satisfies properties 1, 2 and 3 is the logarithmic function.

B. Entropy

Entropy is a measure of the amount of information (or uncertainty) contained in the source.

Definition II.1 For a source X, the entropy H(X) is defined by


X
H(X) = − PX (x) log PX (x)
x∈X

where the source alphabet X is finite, and PX (x) is the probability distribution of the source X for x ∈ X.
By the above definition, entropy can be interpreted as the expected or average amount of self-information
you gain when you learn that one of the |X| outcomes has occurred, where |X| is the cardinality of X.
Another interpretation is that H(X) is a measure of uncertainty of random variable X. As an example,
Let X be a random variable with PX (1) = p and PX (0) = 1p. Then

H(X) = −p · log p − (1 − p) · log(1 − p).

This is called the binary entropy function.

You might also like