{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Implementation of Softmax Regression from Scratch\n",
"\n",
"Just as we implemented linear regression from scratch, \n",
"we believe that multiclass logistic (softmax) regression \n",
"is similarly fundamental and you ought to know \n",
"the gory details of how to implement it from scratch. \n",
"As with linear regression, after doing things by hand\n",
"we will breeze through an implementation in Gluon for comparison. \n",
"To begin, let's import our packages \n",
"(only `autograd`, `nd` are needed here \n",
"because we will be doing the heavy lifting ourselves.)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"%matplotlib inline\n",
"import d2l\n",
"from mxnet import autograd, nd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will work with the Fashion-MNIST dataset just introduced,\n",
"cuing up an iterator with batch size 256."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"batch_size = 256\n",
"train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize Model Parameters\n",
"\n",
"Just as in linear regression, we represent each example as a vector. \n",
"Since each example is a $28 \\times 28$ image,\n",
"we can flatten each example, treating them as $784$ dimensional vectors. \n",
"In the future, we'll talk about more sophisticated strategies \n",
"for exploiting the spatial structure in images, \n",
"but for now we treat each pixel location as just another feature.\n",
"\n",
"Recall that in softmax regression, \n",
"we have as many outputs as there are categories. \n",
"Because our dataset has $10$ categories, \n",
"our network will have an output dimension of $10$. \n",
"Consequently, our weights will constitute a $784 \\times 10$ matrix\n",
"and the biases will constitute a $1 \\times 10$ vector. \n",
"As with linear regression, we will initialize our weights $W$ \n",
"with Gaussian noise and our biases to take the initial value $0$."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "9"
}
},
"outputs": [],
"source": [
"num_inputs = 784\n",
"num_outputs = 10\n",
"\n",
"W = nd.random.normal(scale=0.01, shape=(num_inputs, num_outputs))\n",
"b = nd.zeros(num_outputs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall that we need to *attach gradients* to the model parameters. \n",
"More literally, we are allocating memory for future gradients to be stored\n",
"and notifiying MXNet that we want gradients to be calculated with respect to these parameters in the first place."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "10"
}
},
"outputs": [],
"source": [
"W.attach_grad()\n",
"b.attach_grad()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Softmax\n",
"\n",
"Before implementing the softmax regression model, \n",
"let's briefly review how operators such as `sum` work \n",
"along specific dimensions in an NDArray. \n",
"Given a matrix `X` we can sum over all elements (default) or only \n",
"over elements in the same column (`axis=0`) or the same row (`axis=1`). \n",
"Note that if `X` is an array with shape `(2, 3)` \n",
"and we sum over the columns (`X.sum(axis=0`), \n",
"the result will be a (1D) vector with shape `(3,)`.\n",
"If we want to keep the number of axes in the original array\n",
"(resulting in a 2D array with shape `(1,3)`),\n",
"rather than collapsing out the dimension that we summed over\n",
"we can specify `keepdims=True` when invoking `sum`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "11"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(\n",
" [[5. 7. 9.]]\n",
" , \n",
" [[ 6.]\n",
" [15.]]\n",
" )"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = nd.array([[1, 2, 3], [4, 5, 6]])\n",
"X.sum(axis=0, keepdims=True), X.sum(axis=1, keepdims=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now ready to implement the softmax function. \n",
"Recall that softmax consists of two steps:\n",
"First, we exponentiate each term (using `exp`). \n",
"Then, we sum over each row (we have one row per example in the batch) \n",
"to get the normalization constants for each example. \n",
"Finally, we divide each row by its normalization constant,\n",
"ensuring that the result sums to $1$. \n",
"Before looking at the code, let's recall \n",
"what this looks expressed as an equation:\n",
"\n",
"$$\n",
"\\mathrm{softmax}(\\mathbf{X})_{ij} = \\frac{\\exp(X_{ij})}{\\sum_k \\exp(X_{ik})}\n",
"$$\n",
"\n",
"The denominator, or normalization constant,\n",
"is also sometimes called the partition function\n",
"(and its logarithm the log-partition function). \n",
"The origins of that name are in [statistical physics](https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics)) \n",
"where a related equation models the distribution \n",
"over an ensemble of particles)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "12"
}
},
"outputs": [],
"source": [
"def softmax(X):\n",
" X_exp = X.exp()\n",
" partition = X_exp.sum(axis=1, keepdims=True)\n",
" return X_exp / partition # The broadcast mechanism is applied here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, for any random input, we turn each element into a non-negative number. Moreover, each row sums up to 1, as is required for a probability.\n",
"Note that while this looks correct mathematically,\n",
"we were a bit sloppy in our implementation\n",
"because failed to take precautions against numerical overflow or underflow \n",
"due to large (or very small) elements of the matrix, \n",
"as we did in [Naive Bayes](../chapter_crashcourse/naive-bayes.md)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "13"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(\n",
" [[0.21324193 0.33961776 0.1239742 0.27106097 0.05210521]\n",
" [0.11462264 0.3461234 0.19401033 0.29583326 0.04941036]]\n",
" , \n",
" [1.0000001 1. ]\n",
" )"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = nd.random.normal(shape=(2, 5))\n",
"X_prob = softmax(X)\n",
"X_prob, X_prob.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Model\n",
"\n",
"Now that we have defined the softmax operation, \n",
"we can implement the softmax regression model. \n",
"The below code defines the forward pass through the network.\n",
"Note that we flatten each original image in the batch \n",
"into a vector with length `num_inputs` with the `reshape` function\n",
"before passing the data through our model."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "14"
}
},
"outputs": [],
"source": [
"def net(X):\n",
" return softmax(nd.dot(X.reshape((-1, num_inputs)), W) + b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Loss Function\n",
"\n",
"Next, we need to implement the cross entropy loss function,\n",
"introduced in the [last section](softmax-regression.md).\n",
"This may be the most common loss function \n",
"in all of deep learning because, at the moment, \n",
"classification problems far outnumber regression problems.\n",
"\n",
"\n",
"Recall that cross entropy takes the negative log likelihood \n",
"of the predicted probability assigned to the true label $-\\log p(y|x)$. \n",
"Rather than iterating over the predictions with a Python `for` loop \n",
"(which tends to be inefficient), we can use the `pick` function\n",
"which allows us to select the appropriate terms \n",
"from the matrix of softmax entries easily. \n",
"Below, we illustrate the `pick` function on a toy example,\n",
"with 3 categories and 2 examples."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "15"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[0.1 0.5]\n",
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_hat = nd.array([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])\n",
"y = nd.array([0, 2], dtype='int32')\n",
"nd.pick(y_hat, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can implement the cross-entropy loss function efficiently\n",
"with just one line of code."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "16"
}
},
"outputs": [],
"source": [
"def cross_entropy(y_hat, y):\n",
" return - nd.pick(y_hat, y).log()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classification Accuracy\n",
"\n",
"Given the predicted probability distribution `y_hat`, \n",
"we typically choose the class with highest predicted probability \n",
"whenever we must output a *hard* prediction. Indeed, many applications require that we make a choice. Gmail must catetegorize an email into Primary, Social, Updates, or Forums. It might estimate probabilities internally, but at the end of the day it has to choose one among the categories. \n",
"\n",
"When predictions are consistent with the actual category `y`, they are coorect. The classification accuracy is the fraction of all predictions that are correct. Although we cannot optimize accuracy directly (it is not differentiable), it's often the performance metric that we care most about, and we will nearly always report it when training classifiers.\n",
"\n",
"To compute accuracy we do the following: \n",
"First, we execute `y_hat.argmax(axis=1)` \n",
"to gather the predicted classes \n",
"(given by the indices for the largest entires each row). \n",
"The result has the same shape as the variable `y`. \n",
"Now we just need to check how frequently the two match. \n",
"Since the equality operator `==` is datatype-sensitive \n",
"(e.g. an `int` and a `float32` are never equal), \n",
"we also need to convert both to the same type (we pick `float32`). \n",
"The result is an NDArray containing entries of 0 (false) and 1 (true). \n",
"Taking the mean yields the desired result."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "17"
}
},
"outputs": [],
"source": [
"def accuracy(y_hat, y):\n",
" return (y_hat.argmax(axis=1) == y.astype('float32')).mean().asscalar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will continue to use the variables `y_hat` and `y` \n",
"defined in the `pick` function, \n",
"as the predicted probability distribution and label, respectively. \n",
"We can see that the first example's prediction category is 2 \n",
"(the largest element of the row is 0.6 with an index of 2), \n",
"which is inconsistent with the actual label, 0. \n",
"The second example's prediction category is 2 \n",
"(the largest element of the row is 0.5 with an index of 2), \n",
"which is consistent with the actual label, 2. \n",
"Therefore, the classification accuracy rate for these two examples is 0.5."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "18"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.5"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy(y_hat, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly, we can evaluate the accuracy for model `net` on the data set\n",
"(accessed via `data_iter`)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "19"
}
},
"outputs": [],
"source": [
"# The function will be gradually improved: the complete implementation will be\n",
"# discussed in the \"Image Augmentation\" section\n",
"def evaluate_accuracy(data_iter, net):\n",
" acc_sum, n = 0.0, 0\n",
" for X, y in data_iter:\n",
" y = y.astype('float32')\n",
" acc_sum += (net(X).argmax(axis=1) == y).sum().asscalar()\n",
" n += y.size\n",
" return acc_sum / n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because we initialized the `net` model with random weights, \n",
"the accuracy of this model should be close to random guessing, \n",
"i.e. 0.1 for 10 classes."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "20"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.0925"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"evaluate_accuracy(test_iter, net)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model Training\n",
"\n",
"The training loop for softmax regression should look strikingly familiar\n",
"if you read through our implementation \n",
"of linear regression earlier in this chapter. \n",
"Again, we use the mini-batch stochastic gradient descent \n",
"to optimize the loss function of the model. \n",
"Note that the number of epochs (`num_epochs`), \n",
"and learning rate (`lr`) are both adjustable hyper-parameters. \n",
"By changing their values, we may be able to increase the classification accuracy of the model. In practice we'll want to split our data three ways \n",
"into training, validation, and test data, using the validation data to choose the best values of our hyperparameters."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "21"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 1, loss 0.7874, train acc 0.745, test acc 0.807\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 2, loss 0.5726, train acc 0.812, test acc 0.823\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 3, loss 0.5299, train acc 0.823, test acc 0.828\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 4, loss 0.5057, train acc 0.830, test acc 0.837\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 5, loss 0.4894, train acc 0.834, test acc 0.838\n"
]
}
],
"source": [
"num_epochs, lr = 5, 0.1\n",
"\n",
"# This function has been saved in the d2l package for future use\n",
"def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,\n",
" params=None, lr=None, trainer=None):\n",
" for epoch in range(num_epochs):\n",
" train_l_sum, train_acc_sum, n = 0.0, 0.0, 0\n",
" for X, y in train_iter:\n",
" with autograd.record():\n",
" y_hat = net(X)\n",
" l = loss(y_hat, y).sum()\n",
" l.backward()\n",
" if trainer is None:\n",
" d2l.sgd(params, lr, batch_size)\n",
" else:\n",
" # This will be illustrated in the next section\n",
" trainer.step(batch_size)\n",
" y = y.astype('float32')\n",
" train_l_sum += l.asscalar()\n",
" train_acc_sum += (y_hat.argmax(axis=1) == y).sum().asscalar()\n",
" n += y.size\n",
" test_acc = evaluate_accuracy(test_iter, net)\n",
" print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'\n",
" % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))\n",
"\n",
"train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs,\n",
" batch_size, [W, b], lr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prediction\n",
"\n",
"Now that training is complete, our model is ready to classify some images. \n",
"Given a series of images, we will compare their actual labels \n",
"(first line of text output) and the model predictions \n",
"(second line of text output)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"