{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Convolutional Neural Networks (LeNet)\n",
"\n",
"We are now ready to put all of the tools together\n",
"to deploy your first fully-functional convolutional neural network.\n",
"In our first encounter with image data we applied a [Multilayer Perceptron](../chapter_deep-learning-basics/mlp-scratch.md) \n",
"to pictures of clothing in the Fashion-MNIST data set. \n",
"Each image in Fashion-MNIST consisted of \n",
"a two-dimensional $28 \\times 28$ matrix.\n",
"To make this data amenable to multilayer perceptrons\n",
"which anticapte receiving inputs as one-dimensional fixed-length vectors, \n",
"we first flattened each image, yielding vectors of length 784, \n",
"before processing them with a series of fully-connected layers. \n",
"\n",
"Now that we have introduced convolutional layers,\n",
"we can keep the image in its original spatially-organized grid,\n",
"processing it with a series of successive convolutional layers.\n",
"Moreover, because we are using convolutional layers,\n",
"we can enjoy a considerable savings in the number of parameters required.\n",
"\n",
"In this section, we will introduce one of the first\n",
"published convolutional neural networks\n",
"whose benefit was first demonstrated by Yann Lecun,\n",
"then a researcher at AT&T Bell Labs, \n",
"for the purpose of recognizing handwritten digits in imagesâ€”[LeNet5](http://yann.lecun.com/exdb/lenet/). \n",
"In the 90s, their experiments with LeNet gave the first compelling evidence \n",
"that it was possible to train convolutional neural networks \n",
"by backpropagation. \n",
"Their model achieved outstanding results at the time \n",
"(only matched by Support Vector Machines at the time)\n",
"and was adopted to recognize digits for processing deposits in ATM machines.\n",
"Some ATMs still runn the code \n",
"that Yann and his colleague Leon Bottou wrote in the 1990s!\n",
"\n",
"## LeNet\n",
"\n",
"In a rough sense, we can think LeNet as consisting of two parts: \n",
"(i) a block of convolutional layers; and \n",
"(ii) a block of fully-connected layers. \n",
"Before getting into the weeds, let's briefly review the model in pictures. \n",
"\n",
"![Data flow in LeNet 5. The input is a handwritten digit, the output a probabilitiy over 10 possible outcomes.](../img/lenet.svg)\n",
"\n",
"The basic units in the convolutional block are a convolutional layer \n",
"and a subsequent average pooling layer \n",
"(note that max-pooling works better, \n",
"but it had not been invented in the 90s yet). \n",
"The convolutional layer is used to recognize \n",
"the spatial patterns in the image, \n",
"such as lines and the parts of objects, \n",
"and the subsequent average pooling layer \n",
"is used to reduce the dimensionality. \n",
"The convolutional layer block is composed of \n",
"repeated stacks of these two basic units. \n",
"Each convolutional layer uses a $5\\times 5$ kernel \n",
"and processes each output with a sigmoid activation function\n",
"(again, note that ReLUs are now known to work more reliably, \n",
"but had not been invented yet). \n",
"The first convolutional layer has 6 output channels, \n",
"and second convolutional layer increases channel depth further to 16. \n",
"\n",
"However, coinciding with this increase in the number of channels,\n",
"the height and width are shrunk considerably. \n",
"Therefore, increasing the number of output channels \n",
"makes the parameter sizes of the two convolutional layers similar. \n",
"The two average pooling layers are of size $2\\times 2$ and take stride 2 \n",
"(note that this means they are non-overlapping). \n",
"In other words, the pooling layer downsamples the representation\n",
"to be precisely *one quarter* the pre-pooling size.\n",
"\n",
"The convolutional block emits an output with size given by\n",
"(batch size, channel, height, width). \n",
"Before we can pass the convolutional block's output\n",
"to the fully-connected block, we must flatten \n",
"each example in the mini-batch.\n",
"In other words, we take this 4D input and tansform it into the 2D \n",
"input expected by fully-connected layers:\n",
"as a reminder, the first dimension indexes the examples in the mini-batch\n",
"and the second gives the flat vector representation of each example.\n",
"LeNet's fully-connected layer block has three fully-connected layers,\n",
"with 120, 84, and 10 outputs, respectively. \n",
"Because we are still performing classification,\n",
"the 10 dimensional output layer corresponds \n",
"to the number of possible output classes.\n",
"\n",
"While getting to the point \n",
"where you truly understand \n",
"what's going on inside LeNet \n",
"may have taken a bit of work, \n",
"you can see below that implementing it \n",
"in a modern deep learning library \n",
"is remarkably simple. \n",
"Again, we'll rely on the Sequential class."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"import d2l\n",
"import mxnet as mx\n",
"from mxnet import autograd, gluon, init, nd\n",
"from mxnet.gluon import loss as gloss, nn\n",
"import time\n",
"\n",
"net = nn.Sequential()\n",
"net.add(nn.Conv2D(channels=6, kernel_size=5, padding=2, activation='sigmoid'),\n",
" nn.AvgPool2D(pool_size=2, strides=2),\n",
" nn.Conv2D(channels=16, kernel_size=5, activation='sigmoid'),\n",
" nn.AvgPool2D(pool_size=2, strides=2),\n",
" # Dense will transform the input of the shape (batch size, channel,\n",
" # height, width) into the input of the shape (batch size,\n",
" # channel * height * width) automatically by default\n",
" nn.Dense(120, activation='sigmoid'),\n",
" nn.Dense(84, activation='sigmoid'),\n",
" nn.Dense(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As compared to the original network,\n",
"we took the liberty of replacing \n",
"the Gaussian activation in the last layer \n",
"by a regular dense layer, which tends to be \n",
"significantly more convenient to train. \n",
"Other than that, this network matches \n",
"the historical definition of LeNet5. \n",
"Next, we feed a single-channel example \n",
"of size $28 \\times 28$ into the network \n",
"and perform a forward computation layer by layer \n",
"printing the output shape at each layer\n",
"to make sure we understand what's happening here."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"conv0 output shape:\t (1, 6, 28, 28)\n",
"pool0 output shape:\t (1, 6, 14, 14)\n",
"conv1 output shape:\t (1, 16, 10, 10)\n",
"pool1 output shape:\t (1, 16, 5, 5)\n",
"dense0 output shape:\t (1, 120)\n",
"dense1 output shape:\t (1, 84)\n",
"dense2 output shape:\t (1, 10)\n"
]
}
],
"source": [
"X = nd.random.uniform(shape=(1, 1, 28, 28))\n",
"net.initialize()\n",
"for layer in net:\n",
" X = layer(X)\n",
" print(layer.name, 'output shape:\\t', X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the height and width of the representation \n",
"at each layer throughout the convolutional block is reduced\n",
"(compared to the previous layer). \n",
"The convolutional layer uses a kernel \n",
"with a height and width of 5, \n",
"which with only $2$ pixels of padding in the first convolutional layer\n",
"and none in the second convolutional layer\n",
"leads to reductions in both height and width by 2 and 4 pixels, respectively.\n",
"Moreover each pooling layer halves the height and width. \n",
"However, as we go up the stack of layers,\n",
"the number of channels increases layer-over-layer \n",
"from 1 in the input to 6 after the first convolutional layer\n",
"and 16 after the second layer. \n",
"Then, the fully-connected layer reduces dimensionality layer by layer, \n",
"until emitting an output that matches the number of image classes.\n",
"\n",
"![Compressed notation for LeNet5](../img/lenet-vert.svg)\n",
"\n",
"\n",
"## Data Acquisition and Training\n",
"\n",
"Now that we've implemented the model,\n",
"we might as well run some experiments\n",
"to see what we can accomplish with the LeNet model. \n",
"While it might serve nostalgia\n",
"to train LeNet on the original MNIST OCR dataset,\n",
"that dataset has become too easy,\n",
"with MLPs getting over 98% accuracy,\n",
"so it would be hard to see the benefits of convolutional networks. \n",
"Thus we will stick with Fashion-MNIST as our dataset \n",
"because while it has the same shape ($28\\times28$ images),\n",
"this dataset is notably more challenging."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"batch_size = 256\n",
"train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While convolutional networks may have few parameters,\n",
"they can still be significantly more expensive \n",
"to compute than a similarly deep multilayer perceptron\n",
"so if you have access to a GPU, this might be a good time \n",
"to put it into action to speed up training. \n",
"\n",
"Here's a simple function that we can use to detect whether we have a GPU.\n",
"In it, we try to allocate an NDArray on `gpu(0)`, \n",
"and use `gpu(0)` as our context if the operation proves successful. \n",
"Otherwise, we catch the resulting exception and we stick with the CPU."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"gpu(0)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This function has been saved in the d2l package for future use\n",
"def try_gpu():\n",
" try:\n",
" ctx = mx.gpu()\n",
" _ = nd.zeros((1,), ctx=ctx)\n",
" except mx.base.MXNetError:\n",
" ctx = mx.cpu()\n",
" return ctx\n",
"\n",
"ctx = try_gpu()\n",
"ctx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For evaluation, we need to make a slight modification \n",
"to the `evaluate_accuracy` function that we described \n",
"when [implementing the SoftMax from scratch](../chapter_deep-learning-basics/softmax-regression-scratch.md). \n",
"Since the full dataset lives on the CPU,\n",
"we need to copy it to the GPU before we can compute our models. \n",
"This is accomplished via the `as_in_context` function \n",
"described in the [GPU Computing](../chapter_deep-learning-computation/use-gpu.md) section. \n",
"Note that we accumulate the errors on the device \n",
"where the data eventually lives (in `acc`). \n",
"This avoids intermediate copy operations that might harm performance."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# This function has been saved in the d2l package for future use. The function\n",
"# will be gradually improved. Its complete implementation will be discussed in\n",
"# the \"Image Augmentation\" section\n",
"def evaluate_accuracy(data_iter, net, ctx):\n",
" acc_sum, n = nd.array([0], ctx=ctx), 0\n",
" for X, y in data_iter:\n",
" # If ctx is the GPU, copy the data to the GPU.\n",
" X, y = X.as_in_context(ctx), y.as_in_context(ctx).astype('float32')\n",
" acc_sum += (net(X).argmax(axis=1) == y).sum()\n",
" n += y.size\n",
" return acc_sum.asscalar() / n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also need to update our training function to deal with GPUs. \n",
"Unlike [`train_ch3`](../chapter_deep-learning-basics/softmax-regression-scratch.md), we now need to move each batch of data\n",
"to our designated context (hopefully, the GPU) \n",
"prior to making the forward and backward passes."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# This function has been saved in the d2l package for future use\n",
"def train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx,\n",
" num_epochs):\n",
" print('training on', ctx)\n",
" loss = gloss.SoftmaxCrossEntropyLoss()\n",
" for epoch in range(num_epochs):\n",
" train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()\n",
" for X, y in train_iter:\n",
" X, y = X.as_in_context(ctx), y.as_in_context(ctx)\n",
" with autograd.record():\n",
" y_hat = net(X)\n",
" l = loss(y_hat, y).sum()\n",
" l.backward()\n",
" trainer.step(batch_size)\n",
" y = y.astype('float32')\n",
" train_l_sum += l.asscalar()\n",
" train_acc_sum += (y_hat.argmax(axis=1) == y).sum().asscalar()\n",
" n += y.size\n",
" test_acc = evaluate_accuracy(test_iter, net, ctx)\n",
" print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '\n",
" 'time %.1f sec'\n",
" % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc,\n",
" time.time() - start))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We initialize the model parameters on the device indicated by `ctx`, \n",
"this time using the Xavier initializer. \n",
"The loss function and the training algorithm\n",
"still use the cross-entropy loss function \n",
"and mini-batch stochastic gradient descent."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"training on gpu(0)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 1, loss 2.3188, train acc 0.101, test acc 0.100, time 1.9 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 2, loss 2.1576, train acc 0.174, test acc 0.511, time 1.7 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 3, loss 1.0287, train acc 0.581, test acc 0.695, time 1.6 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 4, loss 0.7813, train acc 0.691, test acc 0.741, time 1.7 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 5, loss 0.6654, train acc 0.739, test acc 0.760, time 1.7 sec\n"
]
}
],
"source": [
"lr, num_epochs = 0.9, 5\n",
"net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())\n",
"trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})\n",
"train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"* A convolutional neural network (in short, ConvNet) is a network using convolutional layers.\n",
"* In a ConvNet we alternate between convolutions, nonlinearities and often also pooling operations.\n",
"* Ultimately the resolution is reduced prior to emitting an output via one (or more) dense layers.\n",
"* LeNet was the first successful deployment of such a network.\n",
"\n",
"## Exercises\n",
"\n",
"1. Replace the average pooling with max pooling. What happens?\n",
"1. Try to construct a more complex network based on LeNet to improve its accuracy.\n",
" * Adjust the convolution window size.\n",
" * Adjust the number of output channels.\n",
" * Adjust the activation function (ReLU?).\n",
" * Adjust the number of convolution layers.\n",
" * Adjust the number of fully connected layers.\n",
" * Adjust the learning rates and other training details (initialization, epochs, etc.)\n",
"1. Try out the improved network on the original MNIST dataset.\n",
"1. Display the activations of the first and second layer of LeNet for different inputs (e.g. sweaters, coats).\n",
"\n",
"\n",
"## References\n",
"\n",
"[1] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.\n",
"\n",
"## Scan the QR Code to [Discuss](https://discuss.mxnet.io/t/2353)\n",
"\n",
"![](../img/qr_lenet.svg)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}