.. _sec_information_theory:
Information Theory
==================
The universe is overflowing with information. Information provides a
common language across disciplinary rifts: from Shakespeare’s Sonnet to
researchers’ paper on Cornell ArXiv, from Van Gogh’s printing Starry
Night to Beethoven’s music Symphony No. 5, from the first programming
language Plankalkül to the state-of-the-art machine learning algorithms.
Everything must follow the rules of information theory, no matter the
format. With information theory, we can measure and compare how much
information is present in different signals. In this section, we will
investigate the fundamental concepts of information theory and
applications of information theory in machine learning.
Before we get started, let’s outline the relationship between machine
learning and information theory. Machine learning aims to extract
interesting signals from data and make critical predictions. On the
other hand, information theory studies encoding, decoding, transmitting,
and manipulating information. As a result, information theory provides
fundamental language for discussing the information processing in
machine learned systems. For example, many machine learning applications
use the cross entropy loss as described in :numref:`sec_softmax`. This
loss can be directly derived from information theoretic considerations.
Information
-----------
Let’s start with the “soul” of information theory: information.
*Information* can be encoded in anything with a particular sequence of
one or more encoding formats. Suppose that we task ourselves with trying
to define a notion of information. What could be are starting point?
Consider the following thought experiment. We have a friend with a deck
of cards. They will shuffle the deck, flip over some cards, and tell us
statements about the cards. We will try to assess the information
content of each statement.
First, they flip over a card and tell us, “I see a card.” This provides
us with no information at all. We were already certain that this was the
case so we hope the information should be zero.
Next, they flip over a card and say, “I see a heart.” This provides us
some information, but in reality there are only :math:`4` different
suits that were possible, each equally likely, so we are not surprised
by this outcome. We hope that whatever the measure of information, this
event should have low information content.
Next, they flip over a card and say, “This is the :math:`3` of spades.”
This is more information. Indeed there were :math:`52` equally likely
possible outcomes, and our friend told us which one it was. This should
be a medium amount of information.
Let’s take this to the logical extreme. Suppose that finally they flip
over every card from the deck and read off the entire sequence of the
shuffled deck. There are :math:`52!` different orders to the deck, again
all equally likely, so we need a lot of information to know which one it
is.
Any notion of information we develop must conform to this intuition.
Indeed, in the next sections we will learn how to compute that these
events have :math:`0\text{ bits}`, :math:`2\text{ bits}`,
:math:`~5.7\text{ bits}`, and :math:`~225.6\text{ bits}` of information
respectively.
If we read through these thought experiments, we see a natural idea. As
a starting point, rather than caring about the knowledge, we may build
off the idea that information represents the degree of surprise or the
abstract possibility of the event. For example, if we want to describe
an unusual event, we need a lot information. For a common event, we may
not need much information.
In 1948, Claude E. Shannon published *A Mathematical Theory of
Communication* :cite:`Shannon.1948` establishing the theory of
information. In his book, Shannon introduced the concept of information
entropy for the first time. We will begin our journey here.
Self-information
~~~~~~~~~~~~~~~~
Since information embodies the abstract possibility of an event, how do
we map the possibility to the number of bits? Shannon introduced the
terminology *bit* as the unit of information, which was originally
created by John Tukey. So what is a “bit” and why do we use it to
measure information? Historically, an antique transmitter can only send
or receive two types of code: :math:`0` and :math:`1`. Indeed, binary
encoding is still in common use on all modern digital computers. In this
way, any information is encoded by a series of :math:`0` and :math:`1`.
And hence, a series of binary digits of length :math:`n` contains
:math:`n` bits of information.
Now, suppose that for any series of codes, each :math:`0` or :math:`1`
occurs with a probability of :math:`\frac{1}{2}`. Hence, an event
:math:`X` with a series of codes of length :math:`n`, occurs with a
probability of :math:`\frac{1}{2^n}`. At the same time, as we mentioned
before, this series contains :math:`n` bits of information. So, can we
generalize to a math function which can transfer the probability
:math:`p` to the number of bits? Shannon gave the answer by defining
*self-information*
.. math:: I(X) = - \log_2 (p),
as the *bits* of information we have received for this event :math:`X`.
Note that we will always use base-2 logarithms in this section. For the
sake of simplicity, the rest of this section will omit the subscript 2
in the logarithm notation, i.e., :math:`\log(.)` always refers to
:math:`\log_2(.)`. For example, the code “0010” has a self-information
.. math:: I(\text{``0010"}) = - \log (p(\text{``0010"})) = - \log \left( \frac{1}{2^4} \right) = 4 \text{ bits}.
We can calculate self information in MXNet as shown below. Before that,
let’s first import all the necessary packages in this section.
.. code:: python
from mxnet import np
from mxnet.metric import NegativeLogLikelihood
from mxnet.ndarray import nansum
import random
def self_information(p):
return -np.log2(p)
self_information(1/64)
.. parsed-literal::
:class: output
6.0
Entropy
-------
As self-information only measures the information of a single discrete
event, we need a more generalized measure for any random variable of
either discrete or continuous distribution.
Motivating Entropy
~~~~~~~~~~~~~~~~~~
Let’s try to get specific about what we want. This will be an informal
statement of what are known as the *axioms of Shannon entropy*. It will
turn out that the following collection of common-sense statements force
us to a unique definition of information. A formal version of these
axioms, along with several others may be found in
:cite:`Csiszar.2008`.
1. The information we gain by observing a random variable does not
depend on what we call the elements, or the presence of additional
elements which have probability zero.
2. The information we gain by observing two random variables is no more
than the sum of the information we gain by observing them separately.
If they are independent, then it is exactly the sum.
3. The information gained when observing (nearly) certain events is
(nearly) zero.
While proving this fact is beyond the scope of our text, it is important
to know that this uniquely determines the form that entropy must take.
The only ambiguity that these allow is in the choice of fundamental
units, which is most often normalized by making the choice we saw before
that the information provided by a single fair coin flip is one bit.
Definition
~~~~~~~~~~
For any random variable :math:`X` that follows a probability
distribution :math:`P` with a probability density function (p.d.f.) or a
probability mass function (p.m.f.) :math:`p(x)`, we measure the expected
amount of information through *entropy* (or *Shannon entropy*)
.. math:: H(X) = - E_{x \sim P} [\log p(x)].
:label: eq_ent_def
To be specific, if :math:`X` is discrete,
.. math:: H(X) = - \sum_i p_i \log p_i \text{, where } p_i = P(X_i).
Otherwise, if :math:`X` is continuous, we also refer entropy as
*differential entropy*
.. math:: H(X) = - \int_x p(x) \log p(x) \; dx.
In MXNet, we can define entropy as below.
.. code:: python
def entropy(p):
entropy = - p * np.log2(p)
# nansum will sum up the non-nan number
out = nansum(entropy.as_nd_ndarray())
return out
entropy(np.array([0.1, 0.5, 0.1, 0.3]))
.. parsed-literal::
:class: output
[1.6854753]
Interpretations
~~~~~~~~~~~~~~~
You may be curious: in the entropy definition :eq:`eq_ent_def`, why
do we use an expectation of a negative logarithm? Here are some
intuitions.
First, why do we use a *logarithm* function :math:`\log`? Suppose that
:math:`p(x) = f_1(x) f_2(x) \ldots, f_n(x)`, where each component
function :math:`f_i(x)` is independent from each other. This means that
each :math:`f_i(x)` contributes independently to the total information
obtained from :math:`p(x)`. As discussed above, we want the entropy
formula to be additive over independent random variables. Luckily,
:math:`\log` can naturally turn a product of probability distributions
to a summation of the individual terms.
Next, why do we use a *negative* :math:`\log`? Intuitively, more
frequent events should contain less information than less common events,
since we often gain more information from an unusual case than from an
ordinary one. However, :math:`\log` is monotonically increasing with the
probabilities, and indeed negative for all values in :math:`[0, 1]`. We
need to construct a monotonically decreasing relationship between the
probability of events and their entropy, which will ideally be always
positive (for nothing we observe should force us to forget what we have
known). Hence, we add a negative sign in front of :math:`\log` function.
Last, where does the *expectation* function come from? Consider a random
variable :math:`X`. We can interpret the self-information
(:math:`-\log(p)`) as the amount of *surprise* we have at seeing a
particular outcome. Indeed, as the probability approaches zero, the
surprise becomes infinite. Similarly, we can interpret The entropy as
the average amount of surprise from observing :math:`X`. For example,
imagine that a slot machine system emits statistical independently
symbols :math:`{s_1, \ldots, s_k}` with probabilities
:math:`{p_1, \ldots, p_k}` respectively. Then the entropy of this system
equals to the average self-information from observing each output, i.e.,
.. math:: H(S) = \sum_i {p_i \cdot I(s_i)} = - \sum_i {p_i \cdot \log p_i}.
Properties of Entropy
~~~~~~~~~~~~~~~~~~~~~
By the above examples and interpretations, we can derive the following
properties of entropy :eq:`eq_ent_def`. Here, we refer to X as an
event and P as the probability distribution of X.
- Entropy is non-negative, i.e., :math:`H(X) \geq 0, \forall X`.
- If :math:`X \sim P` with a p.d.f. or a p.m.f. :math:`p(x)`, and we
try to estimate :math:`P` by a new probability distribution :math:`Q`
with a p.d.f. or a p.m.f. :math:`q(x)`, then
.. math:: H(X) = - E_{x \sim P} [\log p(x)] \leq - E_{x \sim P} [\log q(x)], \text{ with equality if and only if } P = Q.
\ Alternatively, :math:`H(X)` gives a lower bound of the average
number of bits needed to encode symbols drawn from :math:`P`.
- If :math:`X \sim P`, then :math:`x` conveys the maximum amount of
information if it spreads evenly among all possible outcomes.
Specifically, if the probability distribution :math:`P` is discrete
with :math:`k`-class :math:`\{p_1, \ldots, p_k \}`, then
.. math:: H(X) \leq \log(k), \text{ with equality if and only if } p_i = \frac{1}{k}, \forall x_i.
\ If :math:`P` is a continuous random variable, then the story
becomes much more complicated. However, if we additionally impose
that :math:`P` is supported on a finite interval (with all values
between :math:`0` and :math:`1`), then :math:`P` has the highest
entropy if it is the uniform distribution on that interval.
Mutual Information
------------------
Previously we defined entropy of a single random variable :math:`X`, how
about the entropy of a pair random variables :math:`(X, Y)`? We can
think of these techniques as trying to answer the following type of
question, “What information is contained in :math:`X` and :math:`Y`
together compared to each separately? Is there redundant information, or
is it all unique?”
For the following discussion, we always use :math:`(X, Y)` as a pair of
random variables that follows a joint probability distribution :math:`P`
with a p.d.f. or a p.m.f. :math:`p_{X, Y}(x, y)`, while :math:`X` and
:math:`Y` follow probability distribution :math:`p_X(x)` and
:math:`p_Y(y)`, respectively.
Joint Entropy
~~~~~~~~~~~~~
Similar to entropy of a single random variable :eq:`eq_ent_def`, we
define the *joint entropy* :math:`H(X, Y)` of a pair random variables
:math:`(X, Y)` as
.. math:: H(X, Y) = −E_{(x, y) \sim P} [\log p_{X, Y}(x, y)].
:label: eq_joint_ent_def
Precisely, on the one hand, if :math:`(X, Y)` is a pair of discrete
random variables, then
.. math:: H(X, Y) = - \sum_{x} \sum_{y} p_{X, Y}(x, y) \log p_{X, Y}(x, y).
On the other hand, if :math:`(X, Y)` is a pair of continuous random
variables, then we define the *differential joint entropy* as
.. math:: H(X, Y) = - \int_{x, y} p_{X, Y}(x, y) \ \log p_{X, Y}(x, y) \;dx \;dy.
We can think of :eq:`eq_joint_ent_def` as telling us the total
randomness in the pair of random variables. As a pair of extremes, if
:math:`X = Y` are two identical random variables, then the information
in the pair is exactly the information in one and we have
:math:`H(X, Y) = H(X) = H(Y)`. On the other extreme, if :math:`X` and
:math:`Y` are independent then :math:`H(X, Y) = H(X) + H(Y)`. Indeed we
will always have that the information contained in a pair of random
variables is no smaller than the entropy of either random variable and
no more than the sum of both.
.. math::
H(X), H(Y) \le H(X, Y) \le H(X) + H(Y).
Let’s implement joint entropy from scratch in MXNet.
.. code:: python
def joint_entropy(p_xy):
joint_ent = -p_xy * np.log2(p_xy)
# nansum will sum up the non-nan number
out = nansum(joint_ent.as_nd_ndarray())
return out
joint_entropy(np.array([[0.1, 0.5], [0.1, 0.3]]))
.. parsed-literal::
:class: output
[1.6854753]
Notice that this is the same *code* as before, but now we interpret it
differently as working on the joint distribution of the two random
variables.
Conditional Entropy
~~~~~~~~~~~~~~~~~~~
The joint entropy defined above the amount of information contained in a
pair of random variables. This is useful, but often times it is not what
we care about. Consider the setting of machine learning. Let’s take
:math:`X` to be the random variable (or vector of random variables) that
describes the pixel values of an image, and :math:`Y` to be the random
variable which is the class label. :math:`X` should contain substantial
information—a natural image is a complex thing. However, the information
contained in :math:`Y` once the image has been show should be low.
Indeed, the image of a digit should already contain the information
about what digit it is unless the digit is illegible. Thus, to continue
to extend our vocabulary of information theory, we need to be able to
reason about the information content in a random variable conditional on
another.
In the probability theory, we saw the definition of the *conditional
probability* to measure the relationship between variables. We now want
to analogously define the *conditional entropy* :math:`H(Y \mid X)`. We
can write this as
.. math:: H(Y \mid X) = - E_{(x, y) \sim P} [\log p(y \mid x)],
:label: eq_cond_ent_def
where :math:`p(y \mid x) = \frac{p_{X, Y}(x, y)}{p_X(x)}` is the
conditional probability. Specifically, if :math:`(X, Y)` is a pair of
discrete random variables, then
.. math:: H(Y \mid X) = - \sum_{x} \sum_{y} p(x, y) \log p(y \mid x).
If :math:`(X, Y)` is a pair of continuous random variables, then the
*differential joint entropy* is similarly defined as
.. math:: H(Y \mid X) = - \int_x \int_y p(x, y) \ \log p(y \mid x) \;dx \;dy.
It is now natural to ask, how does the *conditional entropy*
:math:`H(Y \mid X)` relate to the entropy :math:`H(X)` and the joint
entropy :math:`H(X, Y)`? Using the definitions above, we can express
this cleanly:
.. math:: H(Y \mid X) = H(X, Y) - H(X).
This has an intuitive interpretation: the information in :math:`Y` given
:math:`X` (:math:`H(Y \mid X)`) is the same as the information in both
:math:`X` and :math:`Y` together (:math:`H(X, Y)`) minus the information
already contained in :math:`X`. This gives us the information in
:math:`Y` which is not also represented in :math:`X`.
Now, let’s implement conditional entropy :eq:`eq_cond_ent_def` from
scratch in MXNet.
.. code:: python
def conditional_entropy(p_xy, p_x):
p_y_given_x = p_xy/p_x
cond_ent = -p_xy * np.log2(p_y_given_x)
# nansum will sum up the non-nan number
out = nansum(cond_ent.as_nd_ndarray())
return out
conditional_entropy(np.array([[0.1, 0.5], [0.2, 0.3]]), np.array([0.2, 0.8]))
.. parsed-literal::
:class: output
[0.8635472]
Mutual Information
~~~~~~~~~~~~~~~~~~
Given the previous setting of random variables :math:`(X, Y)`, you may
wonder: “Now that we know how much information is contained in :math:`Y`
but not in :math:`X`, can we similarly ask how much information is
shared between :math:`X` and :math:`Y`?” The answer will be the *mutual
information* of :math:`(X, Y)`, which we will write as :math:`I(X, Y)`.
Rather than diving straight into the formal definition, let’s practice
our intuition by first trying to derive an expression for the mutual
information entirely based on terms we have constructed before. We wish
to find the information shared between two random variables. One way we
could try to do this is to start with all the information contained in
both :math:`X` and :math:`Y` together, and then we take off the parts
that are not shared. The information contained in both :math:`X` and
:math:`Y` together is written as :math:`H(X, Y)`. We want to subtract
from this the information contained in :math:`X` but not in :math:`Y`,
and the information contained in :math:`Y` but not in :math:`X`. As we
saw in the previous section, this is given by :math:`H(X \mid Y)` and
:math:`H(Y \mid X)` respectively. Thus, we have that the mutual
information should be
.. math::
I(X, Y) = H(X, Y) - H(Y \mid X) − H(X \mid Y).
Indeed, this is a valid definition for the mutual information. If we
expand out the definitions of these terms and combine them, a little
algebra shows that this is the same as
.. math:: I(X, Y) = −E_{x} E_{y} \left\{ p_{X, Y}(x, y) \log\frac{p_{X, Y}(x, y)}{p_X(x) p_Y(y)} \right\}.
:label: eq_mut_ent_def
We can summarize all of these relationships in image
:numref:`fig_mutual_information`. It is an excellent test of intuition
to see why the following statements are all also equivalent to
:math:`I(X, Y)`.
- :math:`H(X) − H(X \mid Y)`
- :math:`H(Y) − H(Y \mid X)`
- :math:`H(X) + H(Y) − H(X, Y)`
.. _fig_mutual_information:
.. figure:: ../img/mutual_information.svg
Mutual information’s relationship with joint entropy and conditional
entropy.
In many ways we can think of the mutual information
:eq:`eq_mut_ent_def` as principled extension of correlation
coefficient we saw in :numref:`sec_random_variables`. This allows us
to ask not only for linear relationships between variables, but for the
maximum information shared between the two random variables of any kind.
Now, let’s implement mutual information from scratch.
.. code:: python
def mutual_information(p_xy, p_x, p_y):
p = p_xy / (p_x * p_y)
mutual = -p_xy * np.log2(p)
# nansum will sum up the non-nan number
out = nansum(mutual.as_nd_ndarray())
return out
mutual_information(np.array([[0.1, 0.5], [0.1, 0.3]]),
np.array([0.2, 0.8]),
np.array([[0.75, 0.25]]))
.. parsed-literal::
:class: output
[-0.71946025]
Properties of Mutual Information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rather than memorizing the definition of mutual information
:eq:`eq_mut_ent_def`, you only need to keep in mind its notable
properties:
- Mutual information is symmetric, i.e., :math:`I(X, Y) = I(Y, X)`.
- Mutual information is non-negative, i.e., :math:`I(X, Y) \geq 0`.
- :math:`I(X, Y) = 0` if and only if :math:`X` and :math:`Y` are
independent. For example, if :math:`X` and :math:`Y` are independent,
then knowing :math:`Y` does not give any information about :math:`X`
and vice versa, so their mutual information is zero.
- Alternatively, if :math:`X` is an invertible function of :math:`Y`,
then :math:`Y` and :math:`X` share all information and
.. math:: I(X, Y) = H(Y) = H(X).
Pointwise Mutual Information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When we worked with entropy at the beginning of this chapter, we were
able to provide an interpretation of :math:`-\log(p_X(x))` as how
*surprised* we were with the particular outcome. We may give a similar
interpretation to the logarithmic term in the mutual information, which
is often referred to as the *pointwise mutual information*:
.. math:: \mathrm{pmi}(x, y) = \log\frac{p_{X, Y}(x, y)}{p_X(x) p_Y(y)}.
:label: eq_pmi_def
We can think of :eq:`eq_pmi_def` as measuring how much more or less
likely the specific combination of outcomes :math:`x` and :math:`y` are
compared to what we would expect for independent random outcomes. If it
is large and positive, then these two specific outcomes occur much more
frequently than they would compared to random chance (*note*: the
denominator is :math:`p_X(x) p_Y(y)` which is the probability of the two
outcomes were independent), whereas if it is large and negative it
represents the two outcomes happening far less than we would expect by
random chance.
This allows us to interpret the mutual information
:eq:`eq_mut_ent_def` as the average amount that we were surprised
to see two outcomes occurring together compared to what we would expect
if they were independent.
Applications of Mutual Information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mutual information may be a little abstract in it pure definition, so
how does it related to machine learning? In natural language processing,
one of the most difficult problems is the *ambiguity resolution*, or the
issue of the meaning of a word being unclear from context. For example,
recently a headline in the news reported that “Amazon is on fire”. You
may wonder whether the company Amazon has a building on fire, or the
Amazon rain forest is on fire.
In this case, mutual information can help us resolve this ambiguity. We
first find the group of words that each has a relatively large mutual
information with the company Amazon, such as e-commerce, technology, and
online. Second, we find another group of words that each has a
relatively large mutual information with the Amazon rain forest, such as
rain, forest, and tropical. When we need to disambiguate “Amazon”, we
can compare which group has more occurrence in the context of the word
Amazon. In this case the article would go on to describe the forest, and
make the context clear.
Kullback–Leibler Divergence
---------------------------
As what we have discussed in :numref:`sec_linear-algebra`, we can use
norms to measure distance between two points in space of any
dimensionality. We would like to be able to do a similar task with
probability distributions. There are many ways to go about this, but
information theory provides one of the nicest. We now explore the
*Kullback–Leibler (KL) divergence*, which provides a way to measure if
two distributions are close together or not.
Definition
~~~~~~~~~~
Given a random variable :math:`X` that follows the probability
distribution :math:`P` with a p.d.f. or a p.m.f. :math:`p(x)`, and we
estimate :math:`P` by another probability distribution :math:`Q` with a
p.d.f. or a p.m.f. :math:`q(x)`. Then the *Kullback–Leibler (KL)
divergence* (or *relative entropy*) between :math:`P` and :math:`Q` is
.. math:: D_{\mathrm{KL}}(P\|Q) = E_{x \sim P} \left[ \log \frac{p(x)}{q(x)} \right].
:label: eq_kl_def
As with the pointwise mutual information :eq:`eq_pmi_def`, we can
again provide an interpretation of the logarithmic term:
:math:`-\log \frac{q(x)}{p(x)} = -\log(q(x)) - (-\log(p(x)))` will be
large and positive if we see :math:`x` far more often under :math:`P`
than we would expect for :math:`Q`, and large and negative if we see the
outcome far less than expected. In this way, we can interpret it as our
*relative* surprise at observing the outcome compared to how surprised
we would be observing it from our reference distribution.
In MXNet, let’s implement the KL divergence from Scratch.
.. code:: python
def kl_divergence(p, q):
kl = p * np.log2(p / q)
out = nansum(kl.as_nd_ndarray())
return out.abs().asscalar()
KL Divergence Properties
~~~~~~~~~~~~~~~~~~~~~~~~
Let’s take a look at some properties of the KL divergence
:eq:`eq_kl_def`.
- KL divergence is non-symmetric, i.e.,
.. math:: D_{\mathrm{KL}}(P\|Q) \neq D_{\mathrm{KL}}(Q\|P), \text{ if } P \neq Q.
- KL divergence is non-negative, i.e.,
.. math:: D_{\mathrm{KL}}(P\|Q) \geq 0.
\ Note that the equality holds only when :math:`P = Q`.
- If there exists an :math:`x` such that :math:`p(x) > 0` and
:math:`q(x) = 0`, then :math:`D_{\mathrm{KL}}(P\|Q) = \infty`.
- There is a close relationship between KL divergence and mutual
information. Besides the relationship shown in
:numref:`fig_mutual_information`, :math:`I(X, Y)` is also
numerically equivalent with the following terms:
1. :math:`D_{\mathrm{KL}}(P(X, Y) \ \| \ P(X)P(Y))`;
2. :math:`E_Y \{ D_{\mathrm{KL}}(P(X \mid Y) \ \| \ P(X)) \}`;
3. :math:`E_X \{ D_{\mathrm{KL}}(P(Y \mid X) \ \| \ P(Y)) \}`.
For the first term, we interpret mutual information as the KL
divergence between :math:`P(X, Y)` and the product of :math:`P(X)`
and :math:`P(Y)`, and thus is a measure of how different the joint
distribution is from the distribution if they were independent. For
the second term, mutual information tells us the average reduction in
uncertainty about :math:`Y` that results from learning the value of
the :math:`X`\ ’s distribution. Similarly to the third term.
Example
~~~~~~~
Let’s go through a toy example to see the non-symmetry explicitly.
First, let’s generate and sort three ``ndarray``\ s of length
:math:`10,000`: an objective ``ndarray`` :math:`p` which follows a
normal distribution :math:`N(0, 1)`, and two candidate ``ndarray``\ s
:math:`q_1` and :math:`q_2` which follow normal distributions
:math:`N(-1, 1)` and :math:`N(1, 1)` respectively.
.. code:: python
random.seed(1)
nd_length = 10000
p = np.random.normal(loc=0, scale=1, size=(nd_length, ))
q1 = np.random.normal(loc=-1, scale=1, size=(nd_length, ))
q2 = np.random.normal(loc=1, scale=1, size=(nd_length, ))
p = np.array(sorted(p.asnumpy()))
q1 = np.array(sorted(q1.asnumpy()))
q2 = np.array(sorted(q2.asnumpy()))
Since :math:`q_1` and :math:`q_2` are symmetric with respect to the
y-axis (i.e., :math:`x=0`), we expect a similar value of KL divergence
between :math:`D_{\mathrm{KL}}(p\|q_1)` and
:math:`D_{\mathrm{KL}}(p\|q_2)`. As you can see below, there is only a
1% off between :math:`D_{\mathrm{KL}}(p\|q_1)` and
:math:`D_{\mathrm{KL}}(p\|q_2)`.
.. code:: python
kl_pq1 = kl_divergence(p, q1)
kl_pq2 = kl_divergence(p, q2)
similar_percentage = abs(kl_pq1 - kl_pq2) / ((kl_pq1 + kl_pq2) / 2) * 100
kl_pq1, kl_pq2, similar_percentage
.. parsed-literal::
:class: output
(8470.638, 8664.999, 2.268504302642314)
In contrast, you may find that :math:`D_{\mathrm{KL}}(q_2 \|p)` and
:math:`D_{\mathrm{KL}}(p \| q_2)` are off a lot, with around 40% off as
shown below.
.. code:: python
kl_q2p = kl_divergence(q2, p)
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100
kl_q2p, differ_percentage
.. parsed-literal::
:class: output
(13536.835, 43.88678828000115)
Cross Entropy
-------------
If you are curious about applications of information theory in deep
learning, here is a quick example. We define the true distribution
:math:`P` with probability distribution :math:`p(x)`, and the estimated
distribution :math:`Q` with probability distribution :math:`q(x)`, and
we will use them in the rest of this section.
Say we need to solve a binary classification problem based on given
:math:`n` data points {:math:`x_1, \ldots, x_n`}. Assume that we encode
:math:`1` and :math:`0` as the positive and negative class label
:math:`y_i` respectively, and our neural network is parameterized by
:math:`\theta`. If we aim to find a best :math:`\theta` so that
:math:`\hat{y}_i= p_{\theta}(y_i \mid x_i)`, it is natural to apply the
maximum log-likelihood approach as was seen in
:numref:`sec_maximum_likelihood`. To be specific, for true labels
:math:`y_i` and predictions :math:`\hat{y}_i= p_{\theta}(y_i \mid x_i)`,
the probability to be classified as positive is
:math:`\pi_i= p_{\theta}(y_i = 1 \mid x_i)`. Hence, the log-likelihood
function would be
.. math::
\begin{aligned}
l(\theta) &= \log L(\theta) \\
&= \log \prod_{i=1}^n \pi_i^{y_i} (1 - \pi_i)^{1 - y_i} \\
&= \sum_{i=1}^n y_i \log(\pi_i) + (1 - y_i) \log (1 - \pi_i). \\
\end{aligned}
Maximizing the log-likelihood function :math:`l(\theta)` is identical to
minimizing :math:`- l(\theta)`, and hence we can find the best
:math:`\theta` from here. To generalize the above loss to any
distributions, we also called :math:`-l(\theta)` the *cross entropy
loss* :math:`\mathrm{CE}(y, \hat{y})`, where :math:`y` follows the true
distribution :math:`P` and :math:`\hat{y}` follows the estimated
distribution :math:`Q`.
This was all derived by working from the maximum likelihood point of
view. However, if we look closely we can see that terms like
:math:`\log(\pi_i)` have entered into our computation which is a solid
indication that we can understand the expression from an information
theoretic point of view.
Formal Definition
~~~~~~~~~~~~~~~~~
Like KL divergence, for a random variable :math:`X`, we can also measure
the divergence between the estimating distribution :math:`Q` and the
true distribution :math:`P` via *cross entropy*,
.. math:: \mathrm{CE}(P, Q) = - E_{x \sim P} [\log(q(x))].
:label: eq_ce_def
By using properties of entropy discussed above, we can also interpret it
as the summation of the entropy :math:`H(P)` and the KL divergence
between :math:`P` and :math:`Q`, i.e.,
.. math:: \mathrm{CE} (P, Q) = H(P) + D_{\mathrm{KL}}(P\|Q).
In MXNet, we can implement the cross entropy loss as below.
.. code:: python
def cross_entropy(y_hat, y):
ce = -np.log(y_hat[range(len(y_hat)), y])
return ce.mean()
Now define two ``ndarray``\ s for the labels and predictions, and
calculate the cross entropy loss of them.
.. code:: python
labels = np.array([0, 2])
preds = np.array([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])
cross_entropy(preds, labels)
.. parsed-literal::
:class: output
array(0.94856)
Properties
~~~~~~~~~~
As alluded in the beginning of this section, cross entropy
:eq:`eq_ce_def` can be used to define a loss function in the
optimization problem. It turns out that the following are equivalent:
1. Maximizing predictive probability of :math:`Q` for distribution
:math:`P`, (i.e., :math:`E_{x \sim P} [\log (q(x))]`);
2. Minimizing cross entropy :math:`\mathrm{CE} (P, Q)`;
3. Minimizing the KL divergence :math:`D_{\mathrm{KL}}(P\|Q)`.
The definition of cross entropy indirectly proves the equivalent
relationship between objective 2 and objective 3, as long as the entropy
of true data :math:`H(P)` is constant.
Cross Entropy as An Objective Function of Multi-class Classification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If we dive deep into the classification objective function with cross
entropy loss :math:`\mathrm{CE}`, we will find minimizing
:math:`\mathrm{CE}` is equivalent to maximizing the log-likelihood
function :math:`L`.
To begin with, suppose that we are given a dataset with :math:`n`
samples, and it can be classified into :math:`k`-classes. For each data
point :math:`i`, we represent any :math:`k`-class label
:math:`\mathbf{y}_i = (y_{i1}, \ldots, y_{ik})` by *one-hot encoding*.
To be specific, if the data point :math:`i` belongs to class :math:`j`,
then we set the :math:`j`-th entry to :math:`1`, and all other
components to :math:`0`, i.e.,
.. math:: y_{ij} = \begin{cases}1 & j \in J; \\ 0 &\text{otherwise.}\end{cases}
For instance, if a multi-class classification problem contains three
classes :math:`A`, :math:`B`, and :math:`C`, then the labels
:math:`\mathbf{y}_i` can be encoded in
{:math:`A: (1, 0, 0); B: (0, 1, 0); C: (0, 0, 1)`}.
Assume that our neural network is parameterized by :math:`\theta`. For
true label vectors :math:`\mathbf{y}_i` and predictions
.. math:: \hat{\mathbf{y}}_i= p_{\theta}(\mathbf{y}_i \mid \mathbf{x}_i) = \sum_{j=1}^k y_{ij} p_{\theta} (y_{ij} \mid \mathbf{x}_i).
Hence, the *cross entropy loss* would be
.. math::
\mathrm{CE}(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{i=1}^n \mathbf{y}_i \log \hat{\mathbf{y}}_i
= - \sum_{i=1}^n \sum_{j=1}^k y_{ij} \log{p_{\theta} (y_{ij} \mid \mathbf{x}_i)}.\\
On the other side, we can also approach the problem through maximum
likelihood estimation. To begin with, let’s quickly introduce a
:math:`k`-class multinoulli distribution. It is an extension of the
Bernoulli distribution from binary class to multi-class. If a random
variable :math:`\mathbf{z} = (z_{1}, \ldots, z_{k})` follows a
:math:`k`-class *multinoulli distribution* with probabilities
:math:`\mathbf{p} =` (:math:`p_{1}, \ldots, p_{k}`), i.e.,
.. math:: p(\mathbf{z}) = p(z_1, \ldots, z_k) = \mathrm{Multi} (p_1, \ldots, p_k), \text{ where } \sum_{i=1}^k p_i = 1,
\ then the joint probability mass function(p.m.f.) of :math:`\mathbf{z}`
is
.. math:: \mathbf{p}^\mathbf{z} = \prod_{j=1}^k p_{j}^{z_{j}}.
It can be seen that each data point, :math:`\mathbf{y}_i`, is following
a :math:`k`-class multinoulli distribution with probabilities
:math:`\boldsymbol{\pi} =` (:math:`\pi_{1}, \ldots, \pi_{k}`).
Therefore, the joint p.m.f. of each data point :math:`\mathbf{y}_i` is
:math:`\mathbf{\pi}^{\mathbf{y}_i} = \prod_{j=1}^k \pi_{j}^{y_{ij}}.`
Hence, the log-likelihood function would be
.. math::
\begin{aligned}
l(\theta)
= \log L(\theta)
= \log \prod_{i=1}^n \boldsymbol{\pi}^{\mathbf{y}_i}
= \log \prod_{i=1}^n \prod_{j=1}^k \pi_{j}^{y_{ij}}
= \sum_{i=1}^n \sum_{j=1}^k y_{ij} \log{\pi_{j}}.\\
\end{aligned}
Since in maximum likelihood estimation, we maximizing the objective
function :math:`l(\theta)` by having
:math:`\pi_{j} = p_{\theta} (y_{ij} \mid \mathbf{x}_i)`. Therefore, for
any multi-class classification, maximizing the above log-likelihood
function :math:`l(\theta)` is equivalent to minimizing the CE loss
:math:`\mathrm{CE}(y, \hat{y})`.
To test the above proof, let’s apply the built-in measure
``NegativeLogLikelihood`` in MXNet. Using the same ``labels`` and
``preds`` as in the earlier example, we will get the same numerical loss
as the previous example up to the 5 decimal place.
.. code:: python
nll_loss = NegativeLogLikelihood()
nll_loss.update(labels.as_nd_ndarray(), preds.as_nd_ndarray())
nll_loss.get()
.. parsed-literal::
:class: output
('nll-loss', 0.9485599994659424)
Summary
-------
- Information theory is a field of study about encoding, decoding,
transmitting, and manipulating information.
- Entropy is the unit to measure how much information is presented in
different signals.
- KL divergence can also measure the divergence between two
distributions.
- Cross Entropy can be viewed as an objective function of multi-class
classification. Minimizing cross entropy loss is equivalent to
maximizing the log-likelihood function.
Exercises
---------
1. Verify that the card examples from the first section indeed have the
claimed entropy.
2. Let’s compute the entropy from a few data sources:
- Assume that you are watching the output generated by a monkey at a
typewriter. The monkey presses any of the :math:`44` keys of the
typewriter at random (you can assume that it has not discovered
any special keys or the shift key yet). How many bits of
randomness per character do you observe?
- Being unhappy with the monkey, you replaced it by a drunk
typesetter. It is able to generate words, albeit not coherently.
Instead, it picks a random word out of a vocabulary of
:math:`2,000` words. Moreover, assume that the average length of a
word is :math:`4.5` letters in English. How many bits of
randomness do you observe now?
- Still being unhappy with the result, you replace the typesetter by
a high quality language model. These can currently obtain
perplexity numbers as low as :math:`15` points per character. The
perplexity is defined as a length normalized probability, i.e.,
.. math:: PPL(x) = \left[p(x)\right]^{1 / \text{length(x)} }.
\ How many bits of randomness do you observe now?
3. Explain intuitively why :math:`I(X, Y) = H(X) - H(X|Y)`. Then, show
this is true by expressing both sides as an expectation with respect
to the joint distribution.
4. What is the KL Divergence between the two Gaussian distributions
:math:`\mathcal{N}(\mu_1, \sigma_1^2)` and
:math:`\mathcal{N}(\mu_2, \sigma_2^2)`?
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_information-theory.svg