{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multilayer Perceptron\n",
"\n",
"In the previous chapters, we showed how you could implement multiclass logistic regression (also called softmax regression) \n",
"for classifying images of clothing into the 10 possible categories.\n",
"To get there, we had to learn how to wrangle data, \n",
"coerce our outputs into a valid probability distribution (via `softmax`), \n",
"how to apply an appropriate loss function, \n",
"and how to optimize over our parameters. \n",
"Now that we’ve covered these preliminaries, \n",
"we are free to focus our attention on \n",
"the more exciting enterprise of designing powerful models\n",
"using deep neural networks.\n",
"\n",
"## Hidden Layers\n",
"\n",
"Recall that for linear regression and sofmax regression,\n",
"we mapped our inputs directly to our outputs \n",
"via a single linear transformation:\n",
"\n",
"$$\n",
"\\hat{\\mathbf{o}} = \\mathrm{softmax}(\\mathbf{W} \\mathbf{x} + \\mathbf{b})\n",
"$$\n",
"\n",
"![Single layer perceptron with 5 output units.](../img/singlelayer.svg)\n",
"\n",
"If our labels really were related to our input data \n",
"by an approximately linear function, then this approach would be perfect. \n",
"But linearity is a *strong assumption*. \n",
"Linearity implies that for whatever target value we are trying to predict,\n",
"increasing the value of each of our inputs \n",
"should either drive the value of the output up or drive it down, \n",
"irrespective of the value of the other inputs.\n",
"\n",
"Sometimes this makes sense! \n",
"Say we are trying to predict whether an individual \n",
"will or will not repay a loan.\n",
"We might reasonably imagine that all else being equal,\n",
"an applicant with a higher income\n",
"would be more likely to repay than one with a lower income. \n",
"In these cases, linear models might perform well,\n",
"and they might even be hard to beat. \n",
"\n",
"But what about classifying images in FashionMNIST? \n",
"Should increasing the intensity of the pixel at location (13,17)\n",
"always increase the likelihood that the image depicts a pocketbook?\n",
"That seems ridiculous because we all know \n",
"that you cannot make sense out of an image \n",
"without accounting for the interactions among pixels.\n",
"\n",
"\n",
"\n",
"### From one to many\n",
"\n",
"As another case, consider trying to classify images \n",
"based on whether they depict *cats* or *dogs* given black-and-white images. \n",
"\n",
"If we use a linear model, we'd basically be saying that\n",
"for each pixel, increasing its value (making it more white) \n",
"must always increases the probability that the image depicts a dog \n",
"or must always increase the probability thatthe image depicts a cat. \n",
"We would be making the absurd assumption that the only requirement \n",
"for differentiating cats vs. dogs is to assess how bright they are. \n",
"That approach is doomed to fail in a work \n",
"that contains both black dogs and black cats, \n",
"and both white dogs and white cats.\n",
"\n",
"Teasing out what is depicted in an image generally requires \n",
"allowing more complex relationships between our inputs and outputs.\n",
"Thus we need models capable of discovering patterns \n",
"that might be characterized by interactions among the many features. \n",
"We can over come these limitations of linear models\n",
"and handle a more general class of functions \n",
"by incorporating one or more hidden layers. \n",
"The easiest way to do this is to stack \n",
"many layers of neurons on top of each other. \n",
"Each layer feeds into the layer above it, until we generate an output. \n",
"This architecture is commonly called a *multilayer perceptron*,\n",
"often abbriviated as *MLP*. \n",
"The neural network diagram for an MLP looks like this:\n",
"\n",
"![Multilayer perceptron with hidden layers. This example contains a hidden layer with 5 hidden units in it. ](../img/mlp.svg)\n",
"\n",
"The multilayer perceptron above has 4 inputs and 3 outputs, \n",
"and the hidden layer in the middle contains 5 hidden units. \n",
"Since the input layer does not involve any calculations, \n",
"building this network would consist of \n",
"implementing 2 layers of computation. \n",
"The neurons in the input layer are fully connected \n",
"to the inputs in the hidden layer. \n",
"Likewise, the neurons in the hidden layer \n",
"are fully connected to the neurons in the output layer.\n",
"\n",
"\n",
"### From linear to nonlinear\n",
"\n",
"We can write out the calculations that define this one-hidden-layer MLP in mathematical notation as follows:\n",
"$$\n",
"\\begin{aligned}\n",
" \\mathbf{h} & = \\mathbf{W}_1 \\mathbf{x} + \\mathbf{b}_1 \\\\\n",
" \\mathbf{o} & = \\mathbf{W}_2 \\mathbf{h} + \\mathbf{b}_2 \\\\\n",
" \\hat{\\mathbf{y}} & = \\mathrm{softmax}(\\mathbf{o})\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"By adding another layer, we have added two new sets of parameters,\n",
"but what have we gained in exchange?\n",
"In the model defined above, we do not achieve anything for our troubles!\n",
"\n",
"That's because our hidden units are just a linear function of the inputs\n",
"and the outputs (pre-softmax) are just a linear function of the hidden units.\n",
"A linear function of a linear function is itself a linear function.\n",
"That means that for any values of the weights,\n",
"we could just collapse out the hidden layer \n",
"yielding an equivalent single-layer model using \n",
"$\\mathbf{W} = \\mathbf{W}_2 \\mathbf{W}_1$ and $\\mathbf{b} = \\mathbf{W}_2 \\mathbf{b}_1 + \\mathbf{b}_2$.\n",
"\n",
"$$\\mathbf{o} = \\mathbf{W}_2 \\mathbf{h} + \\mathbf{b}_2 = \\mathbf{W}_2 (\\mathbf{W}_1 \\mathbf{x} + \\mathbf{b}_1) + \\mathbf{b}_2 = (\\mathbf{W}_2 \\mathbf{W}_1) \\mathbf{x} + (\\mathbf{W}_2 \\mathbf{b}_1 + \\mathbf{b}_2) = \\mathbf{W} \\mathbf{x} + \\mathbf{b}$$\n",
"\n",
"In order to get a benefit from multilayer architectures,\n",
"we need another key ingredient—a nonlinearity $\\sigma$ to be applied to each of the hidden units after each layer's linear transformation. \n",
"The most popular choice for the nonlinearity these days is the recitified linear unit (ReLU) $\\mathrm{max}(x,0)$.\n",
"After incorporating these non-linearities \n",
"it becomes impossible to merge layers.\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
" \\mathbf{h} & = \\sigma(\\mathbf{W}_1 \\mathbf{x} + \\mathbf{b}_1) \\\\\n",
" \\mathbf{o} & = \\mathbf{W}_2 \\mathbf{h} + \\mathbf{b}_2 \\\\\n",
" \\hat{\\mathbf{y}} & = \\mathrm{softmax}(\\mathbf{o})\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"Clearly, we could continue stacking such hidden layers, \n",
"e.g. $\\mathbf{h}_1 = \\sigma(\\mathbf{W}_1 \\mathbf{x} + \\mathbf{b}_1)$ \n",
"and $\\mathbf{h}_2 = \\sigma(\\mathbf{W}_2 \\mathbf{h}_1 + \\mathbf{b}_2)$ \n",
"on top of each other to obtain a true multilayer perceptron.\n",
"\n",
"Multilayer perceptrons can account for complex interactions in the inputs \n",
"because the hidden neurons depend on the values of each of the inputs. \n",
"It’s easy to design a hidden node that that does arbitrary computation, \n",
"such as, for instance, logical operations on its inputs. \n",
"Moreover, for certain choices of the activation function \n",
"it’s widely known that multilayer perceptrons are universal approximators. \n",
"That means that even for a single-hidden-layer neural network, \n",
"with enough nodes, and the right set of weights, \n",
"we can model any function at all! \n",
"*Actually learning that function is the hard part.* \n",
"\n",
"Moreover, just be cause a single-layer network *can* learn any function\n",
"doesn't mean that you should try to solve all of your problems with single-layer networks.\n",
"It turns out that we can approximate many functions \n",
"much more compactly if we use deeper (vs wider) neural networks. \n",
"We’ll get more into the math in a subsequent chapter, \n",
"but for now let’s actually build an MLP. \n",
"In this example, we’ll implement a multilayer perceptron \n",
"with two hidden layers and one output layer.\n",
"\n",
"### Vectorization and mini-batch\n",
"\n",
"As before, by the matrix $\\mathbf{X}$, we denote a mini-batch of inputs. \n",
"The calculations to produce outputs from an MLP with two hidden layers \n",
"can thus be expressed:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
" \\mathbf{H}_1 & = \\sigma(\\mathbf{W}_1 \\mathbf{X} + \\mathbf{b}_1) \\\\\n",
" \\mathbf{H}_2 & = \\sigma(\\mathbf{W}_2 \\mathbf{H}_1 + \\mathbf{b}_2) \\\\\n",
" \\mathbf{O} & = \\mathrm{softmax}(\\mathbf{W}_3 \\mathbf{H}_2 + \\mathbf{b}_3)\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"With some abuse of notation, we define the nonlinearity $\\sigma$ \n",
"to apply to its inputs on a row-wise fashion, i.e. one observation at a time.\n",
"Note that we are also using the notation for *softmax* in the same way to denote a row-wise operation.\n",
"Often, as in this chapter, the activation functions that we apply to hidden layers are not merely row-wise, but component wise.\n",
"That means that after computing the linear portion of the layer,\n",
"we can calculate each nodes activation without looking at the values taken by the other hidden units. \n",
"This is true for most activation functions \n",
"(the [batch normalization](../chapter_convolutional-neural-networks/batch-norm.md) operation is a notable exception to that rule).\n",
"\n",
"## Activation Functions\n",
"\n",
"Beause they are so fundamental to deep leanring, before going further, \n",
"let's take a brief look at some common activation functions. \n",
"\n",
"### ReLU Function\n",
"\n",
"As stated above, the most popular choice,\n",
"due to its simplicity of implementation \n",
"and its efficacy in training is the rectified linear unit (ReLU).\n",
"ReLUs provide a very simple nonlinear transformation. \n",
"Given the element $z$, the function is defined \n",
"as the maximum of that element and 0. \n",
"\n",
"$$\\mathrm{ReLU}(z) = \\max(z, 0).$$\n",
"\n",
"It can be understood that the ReLU function retains only positive elements and discards negative elements (setting those nodes to 0). \n",
"To get a better idea of what it looks like, we can plot it. \n",
"For convenience, we define a plotting function `xyplot` \n",
"to take care of the gruntwork."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "1"
}
},
"outputs": [],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"%matplotlib inline\n",
"import d2l\n",
"from mxnet import autograd, nd\n",
"\n",
"def xyplot(x_vals, y_vals, name):\n",
" d2l.set_figsize(figsize=(5, 2.5))\n",
" d2l.plt.plot(x_vals.asnumpy(), y_vals.asnumpy())\n",
" d2l.plt.xlabel('x')\n",
" d2l.plt.ylabel(name + '(x)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because it is used so commonly, NDarray supports \n",
"the `relu` function as a basic native operator. \n",
"As you can see, the activation function is piece-wise linear."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "2"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"