16.8. Distributions

Now that we have learned about how to work with probability theory in both discrete and continuous setting, lets get to know some of the common random distributions encountered. Depending on the area of machine learning we are working in, we may potentially need to be familiar with vastly more of these, or for some areas of deep learning potentially none at all. This is, however, a good basic list to be familiar with. Let’s first import some common libraries.

%matplotlib inline
import d2l
from IPython import display
from math import erf, factorial
import numpy as np

16.8.1. Bernoulli

This is the simplest random variable usually encountered. This is the random variable that encodes a coin flip which comes up \(1\) with probability \(p\) and \(0\) with probability \(1-p\). If we have a random variable with this distribution, we will write

(16.8.1)\[X \sim \mathrm{Bernoulli}(p).\]

The cumulative distribution function is

(16.8.2)\[\begin{split}F(x) = \begin{cases} 0 & x < 0, \\ 1-p & 0 \le x < 1, \\ 1 & x >= 1 . \end{cases}\end{split}\]

The probability mass function is plotted below.

p = 0.3

d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
../_images/output_distributions_cb606e_3_0.svg

Now, let us plot the cumulative distribution function (16.8.2).

x = np.arange(-1, 2, 0.01)

def F(x):
    return 0 if x < 0 else 1 if x > 1 else 1 - p

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')
../_images/output_distributions_cb606e_5_0.svg

If \(X \sim \mathrm{Bernoulli}(p)\), then:

  • \(\mu_X = p\),

  • \(\sigma_X^2 = p(1-p)\).

We can sample an array of arbitrary shape from a Bernoulli random variable as follows.

1*(np.random.rand(10, 10) < p)
array([[1, 0, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 1, 1, 1, 1],
       [1, 0, 0, 0, 0, 1, 1, 0, 1, 1],
       [1, 1, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 0, 1, 0]])

16.8.2. Discrete Uniform

The next random variable encountered is a discrete uniform distribution. For our discussion here, we will assume that it is on the integers \(\{1,2,\ldots, n\}\), however any other set of values can be freely chosen. The meaning of the word uniform in this context is that every possible value is equally likely. The probability for each value \(i \in \{1,2,3,\ldots,n\}\) is \(p_i = \frac{1}{n}\). We will denote this relationship as

(16.8.3)\[X \sim \mathrm{Uniform}(n).\]

The cumulative distribution function is

(16.8.4)\[\begin{split}F(x) = \begin{cases} 0 & x < 1, \\ \frac{k}{n} & k \le x < k+1 \text{ with } 1 \le k < n, \\ 1 & x >= n . \end{cases}\end{split}\]

Let us first plot the probabilty mass function.

n = 5

d2l.plt.stem([i+1 for i in range(n)], n*[1 / n], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
../_images/output_distributions_cb606e_9_0.svg

Now, let us plot the cumulative distribution function (16.8.4).

x = np.arange(-1, 6, 0.01)

def F(x):
    return 0 if x < 1 else 1 if x > n else np.floor(x) / n

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')
../_images/output_distributions_cb606e_11_0.svg

If \(X \sim \mathrm{Uniform}(n)\), then:

  • \(\mu_X = \frac{1+n}{2}\),

  • \(\sigma_X^2 = \frac{n^2-1}{12}\).

We can an array of arbitrary shape from a discrete uniform random variable as follows. Note that the range

np.random.random_integers(1, n, size=(10, 10))
/var/lib/jenkins/miniconda3/envs/d2l-en-numpy2-0/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: This function is deprecated. Please call randint(1, 5 + 1) instead
  """Entry point for launching an IPython kernel.
array([[1, 4, 4, 1, 4, 2, 5, 4, 3, 1],
       [4, 4, 1, 4, 3, 4, 3, 5, 4, 4],
       [1, 4, 3, 3, 4, 1, 5, 3, 2, 2],
       [4, 3, 1, 1, 2, 4, 5, 5, 2, 5],
       [2, 2, 2, 4, 1, 4, 1, 2, 1, 4],
       [5, 4, 3, 2, 5, 1, 3, 1, 5, 1],
       [1, 2, 4, 1, 1, 3, 4, 4, 3, 5],
       [4, 1, 1, 2, 4, 4, 5, 2, 4, 3],
       [2, 4, 4, 1, 2, 3, 5, 2, 3, 5],
       [4, 1, 3, 2, 5, 5, 1, 1, 3, 3]])

16.8.3. Continuous Uniform

Next let us discuss the continuous uniform distribution. The idea behind this random variable is that if we increase the \(n\) in the previous distribution, and then scale it to fit within the interval \([a,b]\), we will approach a continuous random variable that just picks an arbitrary value in \([a,b]\) all with equal probability. We will denote this distribution as

(16.8.5)\[X \sim \mathrm{Uniform}([a,b]).\]

The probability density function is

(16.8.6)\[\begin{split}p(x) = \begin{cases} \frac{1}{b-a} & x \in [a,b], \\ 0 & x \not\in [a,b].\end{cases}\end{split}\]

The cumulative distribution function is

(16.8.7)\[\begin{split}F(x) = \begin{cases} 0 & x < a, \\ \frac{x-a}{b-a} & x \in [a,b], \\ 1 & x >= b . \end{cases}\end{split}\]

Let us first plot the probabilty density function (16.8.6).

a, b = 1, 3

x = np.arange(0, 4, 0.01)
p = (x > a)*(x < b)/(b - a)

d2l.plot(x, p, 'x', 'p.d.f.')
../_images/output_distributions_cb606e_15_0.svg

Now, let us plot the cumulative distribution function (16.8.7).

def F(x):
    return 0 if x < a else 1 if x > b else (x - a) / (b - a)

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')
../_images/output_distributions_cb606e_17_0.svg

If \(X \sim \mathrm{Uniform}([a,b])\), then:

  • \(\mu_X = \frac{a+b}{2}\),

  • \(\sigma_X^2 = \frac{(b-a)^2}{12}\).

We can an array of arbitrary shape from a uniform random variable as follows. Note that it by default samples from a \(\mathrm{Uniform}([a,b])\), so if we want a different range we need to scale it.

(b - a) * np.random.rand(10, 10) + a
array([[1.03622044, 2.9476005 , 2.56464539, 1.99988328, 1.62031464,
        2.9923809 , 2.36948021, 1.64962213, 2.56086417, 1.11998753],
       [1.14697106, 2.8539198 , 2.57145819, 2.01444509, 2.52837295,
        2.95049374, 1.57990296, 1.70432074, 2.06137278, 1.02643011],
       [2.05724444, 2.07694326, 2.94231307, 1.03616165, 2.84953272,
        2.82310637, 1.5755503 , 1.41007716, 2.74547407, 2.95507107],
       [1.64575348, 2.86748828, 1.21879665, 1.66444659, 2.38646705,
        1.36990201, 2.603569  , 1.26678044, 1.56463659, 1.91539985],
       [2.1317415 , 2.20564835, 2.10959752, 1.61564938, 1.56553141,
        2.90914743, 2.04552581, 1.65692261, 1.74343444, 1.99362928],
       [1.60010001, 1.87087855, 2.18150088, 1.18565005, 2.53669835,
        1.58175251, 1.70811565, 1.86468444, 1.78191379, 2.99486308],
       [1.20757821, 1.68280768, 1.42473233, 2.53642348, 1.14844189,
        1.0016402 , 1.31976449, 1.57837876, 2.85078786, 2.72461176],
       [2.38766633, 1.26108685, 2.61099439, 2.74903853, 2.10584575,
        1.24654815, 2.72772648, 1.66928316, 2.52663966, 2.80083426],
       [1.67815467, 1.80735666, 2.21306342, 1.71799946, 2.50494323,
        1.26065578, 2.46680196, 2.39411835, 1.69289933, 1.31373398],
       [1.28360239, 2.21406619, 2.52224219, 2.28987029, 1.0406941 ,
        1.3459523 , 2.44969644, 2.4510143 , 1.36135503, 1.71842632]])

16.8.4. Binomial

Let us make things a little more complex and examine the binomial random variable. This random variable originates from performing a sequence of \(n\) independent experiments, each of which have probability \(p\) of succeeding, and asking how many successes we expect to see.

Let us express this mathematically. Each experiment is an independent random variable \(X_i\) where we will use \(1\) to encode success, and \(0\) to encode failure. Since each is an independent coin flip which is successful with probability \(p\), we can say that \(X_i \sim \mathrm{Bernoulli}(p)\). Then, the binomial random variable is

(16.8.8)\[X = \sum_{i=1}^n X_i.\]

In this case, we will write

(16.8.9)\[X \sim \mathrm{Binomial}(n,p).\]

To get the cumulative distribution function, we need to notice that getting exactly \(k\) successes can occur in \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) ways each of which has a probability of \(p^m(1-p)^{n-m}\) of occuring. Thus the cumulative distribution function is

(16.8.10)\[\begin{split}F(x) = \begin{cases} 0 & x < 0, \\ \sum_{m \le k} \binom{n}{m} p^m(1-p)^{n-m} & k \le x < k+1 \text{ with } 0 \le k < n, \\ 1 & x >= n . \end{cases}\end{split}\]

Let us first plot the probabilty mass function.

n, p = 10, 0.2

# Compute binomial coefficient
def binom(n, k):
    comb = 1
    for i in range(min(k, n - k)):
        comb = comb * (n - i) // (i + 1)
    return comb

pmf = np.array([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])

d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
../_images/output_distributions_cb606e_21_0.svg

Now, let us plot the cumulative distribution function (16.8.10).

x = np.arange(-1, 11, 0.01)
cmf = np.cumsum(pmf)

def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, np.array([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
../_images/output_distributions_cb606e_23_0.svg

While this result is not simple, the means and variances are. If \(X \sim \mathrm{Binomial}(n,p)\), then:

  • \(\mu_X = np\),

  • \(\sigma_X^2 = np(1-p)\).

This can be sampled as follows.

np.random.binomial(n, p, size=(10, 10))
array([[2, 3, 2, 1, 0, 1, 3, 0, 4, 2],
       [2, 4, 1, 0, 3, 4, 3, 2, 2, 1],
       [2, 2, 2, 2, 2, 1, 1, 3, 3, 3],
       [1, 2, 0, 3, 0, 0, 1, 3, 2, 3],
       [3, 1, 0, 1, 2, 1, 3, 2, 3, 1],
       [4, 3, 4, 2, 2, 1, 0, 1, 1, 1],
       [1, 2, 0, 4, 1, 2, 3, 3, 3, 3],
       [2, 2, 2, 6, 3, 3, 3, 3, 3, 0],
       [4, 1, 4, 3, 2, 3, 1, 2, 2, 1],
       [1, 3, 3, 3, 1, 3, 2, 0, 3, 2]])

16.8.5. Poisson

Let us now perform a thought experiment. Let us say we are standing at a bus stop and we want to know how many buses will arrive in the next minute. Lets start by considering \(X^{(1)} \sim \mathrm{Bernoulli}(p)\) Which is simply the probability that a bus arrives in the one minute window. For bus stops far from an urban center, this might be a pretty good approximation since we will never see more than one bus at a time.

However, if we are in a busy area, it is possible or even likely that two buses will arrive. We can model this by splitting our random variable into two parts for the first 30 seconds, or the second 30 seconds. In this case we can write

(16.8.11)\[X^{(2)} \sim X^{(2)}_1 + X^{(2)}_2.\]

where \(X^{(2)}\) is the total sum, and \(X^{(2)}_i \sim \mathrm{Bernoulli}(p/2)\). The total distribution is then \(X^{(2)} \sim \mathrm{Binomial}(2,p/2)\).

Why stop here? Let us continue to split that minute into \(n\) parts. By the same reasoning as above, we see that

(16.8.12)\[X^{(n)} \sim \mathrm{Binomial}(n,p/n).\]

Let us consider these random variables. By the previous section, we know that this has mean \(\mu_{X^{(n)}} = n(p/n) = p\), and variance \(\sigma_{X^{(n)}}^2 = n(p/n)(1-(p/n)) = p(1-p/n)\). If we take \(n \rightarrow \infty\), we can see that these numbers stabilize to \(\mu_{X^{(\infty)}} = p\), and variance \(\sigma_{X^{(\infty)}}^2 = p\)! What this indicates is that there could be some random variable we can define which is well defined in this infinite subdivision limit.

This should not come as too much of a surprise, since in the real world we can just count the number of bus arrivals, however it is nice to see that our mathematical model is well defined. This result is known as the law of rare events.

Following through this reasoning carefully, we can arrive at the following model. We will say that \(X \sim \mathrm{Poisson}(\lambda)\) if it is a random variable which takes the values \(\{0,1,2,\ldots\}\) with probability

(16.8.13)\[p_k = \frac{\lambda^ke^{-\lambda}}{k!}.\]

The value \(\lambda > 0\) is known as the rate, and denotes the average number of arrivals we expect in one unit of time (note that we above restricted our rate to be less than zero, but that was only to simplify the explanation).

We may sum this probability mass function to get the cumulative distribution function.

(16.8.14)\[\begin{split}F(x) = \begin{cases} 0 & x < 0, \\ e^{-\lambda}\sum_{m = 0}^k \frac{\lambda^m}{m!} & k \le x < k+1 \text{ with } 0 \le k. \end{cases}\end{split}\]

Let us first plot the probabilty mass function (16.8.13).

lam = 5.0

xs = [i for i in range(20)]
pmf = np.array([np.exp(-lam) * lam**k / factorial(k) for k in xs])

d2l.plt.stem(xs, pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
../_images/output_distributions_cb606e_27_0.svg

Now, let us plot the cumulative distribution function (16.8.14).

x = np.arange(-1, 21, 0.01)
cmf = np.cumsum(pmf)
def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, np.array([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
../_images/output_distributions_cb606e_29_0.svg

As we saw above, the means and variances are particularly concise. If \(X \sim \mathrm{Poisson}(\lambda)\), then:

  • \(\mu_X = \lambda\),

  • \(\sigma_X^2 = \lambda\).

This can be sampled as follows.

np.random.poisson(lam, size=(10, 10))
array([[ 6,  5,  8,  3,  8,  1,  6,  5,  4,  2],
       [ 3,  6,  6,  2,  1,  6,  3,  7,  4,  5],
       [ 5,  7,  6,  4,  8,  4,  5,  9,  2,  4],
       [ 6,  5,  5,  5,  2,  5,  6,  5,  1,  7],
       [ 7,  5,  5, 11,  6,  8,  2,  9,  7,  4],
       [ 8,  5,  3,  3,  7,  6,  6,  2,  4,  7],
       [ 3,  7,  8,  1,  3,  6,  1,  1,  4,  7],
       [ 5,  6,  4,  2,  3,  2,  3, 12,  5,  6],
       [ 3,  1,  5,  5,  6,  6,  4,  6,  4,  5],
       [ 5,  4,  9,  7,  7,  4,  7, 10, 10,  4]])

16.8.6. Gaussian

Now Let us try a different, but related experiment. Let us say we again are performing \(n\) independent \(\mathrm{Bernoulli}(p)\) measurements \(X_i\). The distribution of the sum of these is \(X^{(n)} \sim \mathrm{Binomial}(n,p)\). Rather than taking a limit as \(n\) increases and \(p\) decreases, Let us fix \(p\), and then send \(n \rightarrow \infty\). In this case \(\mu_{X^{(n)}} = np \rightarrow \infty\) and \(\sigma_{X^{(n)}}^2 = np(1-p) \rightarrow \infty\), so there is no reason to think this limit should be well defined.

However, not all hope is lost! Let us just make the mean and variance be well behaved by defining

(16.8.15)\[Y^{(n)} = \frac{X^{(n)} - \mu_{X^{(n)}}}{\sigma_{X^{(n)}}}.\]

This can be seen to have mean zero and variance one, and so it is plausible to believe that it will converge to some limiting distribution.

p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
    n = ns[i]
    pmf = np.array([p**i * (1-p)**(n-i) * binom(n, i) for i in range(n + 1)])
    d2l.plt.subplot(1, 4, i + 1)
    d2l.plt.stem([(i - n*p)/np.sqrt(n*p*(1 - p)) for i in range(n + 1)], pmf,
                 use_line_collection=True)
    d2l.plt.xlim([-4, 4])
    d2l.plt.xlabel('x')
    d2l.plt.ylabel('p.m.f.')
    d2l.plt.title("n = {}".format(n))
d2l.plt.show()
../_images/output_distributions_cb606e_33_0.svg

One thing to note: compared to the Poisson case, we are now diving by the standard deviation which means that we are squeezing the possible outcomes into smaller and smaller areas. This is an indication that our limit will no longer be discrete, but rather a continuous distribution.

A derivation of what occurs is well beyond the scope of this document, but the central limit theorem states that as \(n \rightarrow \infty\), this will yield the Gaussian Distribution (or sometimes Normal distribution). More explicitly, for any \(a,b\):

(16.8.16)\[\lim_{n \rightarrow \infty} P(Y^{(n)} \in [a,b]) = P(\mathcal{N}(0,1) \in [a,b]),\]

where we say a random variable is normally distributed with given mean \(\mu\) and variance \(\sigma^2\), written \(X \sim \mathcal{N}(\mu,\sigma^2)\) if \(X\) has density

(16.8.17)\[p_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\]

Let us first plot the probability density function (16.8.17).

mu, sigma = 0, 1

x = np.arange(-3, 3, 0.01)
p = 1 / np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2 / (2 * sigma**2))

d2l.plot(x, p, 'x', 'p.d.f.')
../_images/output_distributions_cb606e_35_0.svg

Now, let us plot the cumulative distribution function. It is beyond the scope of this appendix, but the Gaussian c.d.f. does not have a closed-form formula in terms of more elementary functions. We will use erf which provideds a way to compute this integral numerically.

def phi(x):
    return (1.0 + erf((x - mu) / (sigma * np.sqrt(2)))) / 2.0

d2l.plot(x, np.array([phi(y) for y in x.tolist()]), 'x', 'c.d.f.')
../_images/output_distributions_cb606e_37_0.svg

Keen-eyed readers will recognize some of these terms. Indeed, we encountered this integral we encountered in Section 16.5. Indeed we need exactly that computation to see that this \(p_X(x)\) has total area one and is thus a valid density.

Our choice of working with coin flips made computations shorter, but nothing about that choice was fundamental. Indeed, if we take any collection of independent identically distributed random variables \(X_i\), and form

(16.8.18)\[X^{(N)} = \sum_{i=1}^N X_i.\]

then

(16.8.19)\[\frac{X^{(N)} - \mu_{X^{(N)}}}{\sigma_{X^{(N)}}},\]

will be approximately Gaussian.

This is the reason that the Gaussian is so central to probability, statistics, and machine learning. Whenever we can say that something we measured is a sum of many small independent contributions, we can safely assume that the thing being measured will be close to Gaussian. There are additional technical requirements needed to make it work (the most commonly shown proof need \(E[X^4] < \infty\) for the independent random variables being added), but the philosophy is clear.

There are many more fascinating properties of Gaussians than we can get into at this point. In particular, the Gaussian is what is known as a maximum entropy distribution. We will get into entropy more deeply in Section 16.11, however all we need to know at this point is that it is a measure of randomness. In a rigorous mathematical sense, we can think of the Gaussian as the most random choice of random variable with fixed mean and variance. Thus, if we know that our random variable has some mean and variance, the Gaussian is in a sense the most conservative choice of distribution we can make.

To close the section, Let us recall that if \(X \sim \mathcal{N}(\mu,\sigma^2)\), then:

  • \(\mu_X = \mu\),

  • \(\sigma_X^2 = \sigma^2\).

We can sample from the Gaussian (or normal) as shown below.

np.random.normal(mu, sigma, size=(10, 10))
array([[-0.63949389, -0.24530997, -0.56666164, -0.08419307, -0.09554704,
        -0.30549514,  0.28729854, -1.25766955,  1.80486434,  0.26919607],
       [-0.77918539,  0.26659797,  2.39473801,  1.07936504,  0.38008657,
        -0.13701216, -0.63000631, -1.11032322,  0.89560036,  0.98752602],
       [-0.30407789,  2.07066265,  0.66744322,  0.01102993, -0.49496034,
        -1.32982284, -0.61714529, -0.61520507,  1.0985294 , -1.27619895],
       [ 1.55121261,  0.46784465,  0.25567573,  1.09332036, -0.85372054,
         0.34656911, -0.62870833, -1.18090726,  1.11274807,  0.47866461],
       [ 2.19873732,  0.49866639, -0.80075339,  0.47676037,  0.67889808,
         1.26713188,  0.37395834,  0.51869205,  1.47504046, -0.15219646],
       [-0.11635075,  0.22124654, -0.2526788 , -0.02395057, -0.70456345,
         0.22705479,  0.27395872,  1.07798873,  0.60349109,  0.72450661],
       [ 0.50675147, -0.5351561 ,  0.10298418, -1.15521094, -1.6345498 ,
        -0.88197322, -1.41595558, -0.59236168,  0.28063592, -0.14712107],
       [ 2.60484785, -1.95905532,  3.36660648,  1.26827329, -0.76688952,
         1.5877968 , -0.05067751,  1.15427934,  1.35915427,  1.16728218],
       [ 0.08212443,  1.46583006,  2.30420102,  0.04831307,  1.77889612,
         0.49091422, -0.59287109, -1.05227478,  1.79313453, -0.45552771],
       [ 1.8389271 ,  0.72060001,  1.19868888,  0.2106386 , -1.74313691,
        -0.29237516, -0.57248219, -0.60044081, -1.29016016, -0.23095353]])

16.8.7. Summary

  • Bernoulli random variables can be used to model events with a yes/no outcome.

  • Discrete uniform distributions model selections from a finite set of possibilites.

  • Continuous uniform distributions select from an interval.

  • Binomial distributions model a series of Bernoulli random variables, and count the number of successes.

  • Poisson random variables model the arrival of rare events.

  • Gaussian random variables model the results of adding a large number of independent random variables together.

16.8.8. Exercises

  1. What is the standard deviation of a random variable that is the difference \(X-Y\) of two indepent binomial random variables \(X,Y \sim \mathrm{Binomial}(16,1/2)\).

  2. If we take a Poisson random variable \(X \sim \mathrm{Poisson}(\lambda)\) and consider \((X - \lambda)/\sqrt{\lambda}\) as \(\lambda \rightarrow \infty\), we can show that this becomes approximately Gaussian. Why does this make sense?

  3. What is the probability mass function for a sum of two discrete uniform random variables on \(n\) elements?

16.8.9. Discussions

image0