{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Naive Bayes Classification\n",
"\n",
"Before we worry about complex optimization algorithms or GPUs, we can already deploy our first classifier, relying only on simple statistical estimators and our understanding of conditional independence. Learning is all about making assumptions. If we want to classify a new data point that we've never seen before we have to make some assumptions about which data points are *similar* to each other. \n",
"\n",
"One popular (and remarkably simple) algorithm is the Naive Bayes Classifier. Note that one natural way to express the classification task is via the probabilistic question: *what is the most likely label given the features?*. Formally, we wish to output the prediction $\\hat{y}$ given by the expression:\n",
"\n",
"$$\\hat{y} = \\text{argmax}_y \\> p(y | \\mathbf{x})$$\n",
"\n",
"Unfortunately, this requires that we estimate $p(y | \\mathbf{x})$ for every value of $\\mathbf{x} = x_1, ..., x_d$. Imagine that each feature could take one of $2$ values. For example, the feature $x_1 = 1$ might signify that the word apple appears in a given document and $x_1 = 1$ would signify that it does not. If we had $30$ such binary features, that would mean that we need to be prepared to classify any of $2^{30}$ (over 1 billion!) possible values of the input vector $\\mathbf{x}$. \n",
"\n",
"Moreover, where is the learning? If we need to see every single possible example in order to predict the corresponding label then we're not really learning a pattern but just memorizing the dataset. Fortunately, by making some assumptions about conditional independence, we can introduce some inductive bias and build a model capable of generalizing from a comparatively modest selection of training examples. \n",
"\n",
"To begin, let's use Bayes Theorem, to express the classifier as \n",
"\n",
"$$\\hat{y} = \\text{argmax}_y \\> \\frac{p( \\mathbf{x} | y) p(y)}{p(\\mathbf{x})}$$\n",
"\n",
"Note that the denominator is the normalizing term $p(\\mathbf{x})$ which does not depend on the value of the label $y$. As a result, we only need to worry about comparing the numerator across different values of $y$. Even if calculating the demoninator turned out to be intractable, we could get away with ignoring it, so long as we could evaluate the numerator. Fortunately, however, even if we wanted to recover the normalizing constant, we could, since we know that $\\sum_y p(y | \\mathbf{x}) = 1$, hence we can always recover the normalization term.\n",
"Now, using the chain rule of probability, we can express the term $p( \\mathbf{x} | y)$ as\n",
"\n",
"$$p(x_1 |y) \\cdot p(x_2 | x_1, y) \\cdot ... \\cdot p( x_d | x_1, ..., x_{d-1} y)$$\n",
"\n",
"By itself, this expression doesn't get us any further. We still must estimate roughly $2^d$ parameters. However, if we assume that ***the features are conditionally indpendent of each other, given the label***, then suddenly we're in much better shape, as this term simplifies to $\\prod_i p(x_i | y)$, giving us the predictor\n",
"\n",
"$$ \\hat{y} = \\text{argmax}_y \\> = \\prod_i p(x_i | y) p(y)$$\n",
"\n",
"Estimating each term in $\\prod_i p(x_i | y)$ amounts to estimating just one parameter. So our assumption of conditional independence has taken the complexity of our model (in terms of the number of parameters) from an exponential dependence on the number of features to a linear dependence. Moreover, we can now make predictions for examples that we've never seen before, because we just need to estimate the terms $p(x_i | y)$, which can be estimated based on a number of different documents.\n",
"\n",
"Let's take a closer look at the key assumption that the attributes are all independent of each other, given the labels, i.e., $p(\\mathbf{x} | y) = \\prod_i p(x_i | y)$. Consider classifying emails into spam and ham. It's fair to say that the occurrence of the words `Nigeria`, `prince`, `money`, `rich` are all likely indicators that the e-mail might be spam, whereas `theorem`, `network`, `Bayes` or `statistics` are good indicators that the exchange is less likely to be part of an orchestrated attempt to wheedle out your bank account numbers. Thus, we could model the probability of occurrence for each of these words, given the respective class and then use it to score the likelihood of a text. In fact, for a long time this *is* preciely how many so-called [Bayesian spam filters](https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering) worked.\n",
"\n",
"\n",
"## Optical Character Recognition\n",
"\n",
"Since images are much easier to deal with, we will illustrate the workings of a Naive Bayes classifier for distinguishing digits on the MNIST dataset. The problem is that we don't actually know $p(y)$ and $p(x_i | y)$. So we need to *estimate* it given some training data first. This is what is called *training* the model. Estimating $p(y)$ is not too hard. Since we are only dealing with 10 classes, this is pretty easy - simply count the number of occurrences $n_y$ for each of the digits and divide it by the total amount of data $n$. For instance, if digit 8 occurs $n_8 = 5,800$ times and we have a total of $n = 60,000$ images, the probability estimate is $p(y=8) = 0.0967$.\n",
"\n",
"Now on to slightly more difficult thingsâ€”$p(x_i | y)$. Since we picked black and white images, $p(x_i | y)$ denotes the probability that pixel $i$ is switched on for class $y$. Just like before we can go and count the number of times $n_{iy}$ such that an event occurs and divide it by the total number of occurrences of y, i.e. $n_y$. But there's something slightly troubling: certain pixels may never be black (e.g. for very well cropped images the corner pixels might always be white). A convenient way for statisticians to deal with this problem is to add pseudo counts to all occurrences. Hence, rather than $n_{iy}$ we use $n_{iy}+1$ and instead of $n_y$ we use $n_{y} + 1$. This is also called [Laplace Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "1"
}
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"from matplotlib import pyplot as plt\n",
"from IPython import display\n",
"display.set_matplotlib_formats('svg')\n",
"import mxnet as mx\n",
"from mxnet import nd\n",
"import numpy as np\n",
"\n",
"# We go over one observation at a time (speed doesn't matter here)\n",
"def transform(data, label):\n",
" return (nd.floor(data/128)).astype(np.float32), label.astype(np.float32)\n",
"mnist_train = mx.gluon.data.vision.MNIST(train=True, transform=transform)\n",
"mnist_test = mx.gluon.data.vision.MNIST(train=False, transform=transform)\n",
"\n",
"# Initialize the counters\n",
"xcount = nd.ones((784,10))\n",
"ycount = nd.ones((10))\n",
"\n",
"for data, label in mnist_train:\n",
" y = int(label)\n",
" ycount[y] += 1\n",
" xcount[:,y] += data.reshape((784))\n",
"\n",
"# using broadcast again for division\n",
"py = ycount / ycount.sum()\n",
"px = (xcount / ycount.reshape(1,10))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "9"
}
},
"outputs": [],
"source": [
"for data, label in mnist_train:\n",
" y = int(label)\n",
" ycount[y] += 1\n",
" xcount[:,y] += data.reshape((784))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we computed per-pixel counts of occurrence for all pixels, it's time to see how our model behaves. Time to plot it. This is where it is so much more convenient to work with images. Visualizing 28x28x10 probabilities (for each pixel for each class) would typically be an exercise in futility. However, by plotting them as images we get a quick overview. The astute reader probably noticed by now that these are some mean looking digits ..."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "2"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"