{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multiple Input and Output Channels\n",
"\n",
"While we have described the multiple channels\n",
"that comprise each image (e.g. color images have the standard RGB channels \n",
"to indicate the amount of red, green and blue),\n",
"until now, we simplified all of our numerical examples \n",
"by working with just a single input and a single output channel.\n",
"This has allowed us to think of our inputs, convolutional kernels,\n",
"and outputs each as two-dimensional arrays.\n",
"\n",
"When we add channels into the mix, \n",
"our inputs and hidden representations \n",
"both become three-dimensional arrays.\n",
"For example, each RGB input image has shape $3\\times h\\times w$. \n",
"We refer to this axis, with a size of 3, as the channel dimension. \n",
"In this section, we will take a deeper look \n",
"at convolution kernels with multiple input and multiple output channels.\n",
"\n",
"## Multiple Input Channels\n",
"\n",
"When the input data contains multiple channels, \n",
"we need to construct a convolution kernel \n",
"with the same number of input channels as the input data, \n",
"so that it can perform cross-correlation with the input data. \n",
"Assuming that the number of channels for the input data is $c_i$, \n",
"the number of input channels of the convolution kernel also needs to be $c_i$. If our convolution kernel's window shape is $k_h\\times k_w$,\n",
"then when $c_i=1$, we can think of our convolution kernel \n",
"as just a two-dimensional array of shape $k_h\\times k_w$. \n",
"\n",
"However, when $c_i>1$, we need a kernel \n",
"that contains an array of shape $k_h\\times k_w$ *for each input channel*. Concatenating these $c_i$ arrays together \n",
"yields a convolution kernel of shape $c_i\\times k_h\\times k_w$. \n",
"Since the input and convolution kernel each have $c_i$ channels, \n",
"we can perform a cross-correlation operation \n",
"on the two-dimensional array of the input \n",
"and the two-dimensional kernel array of the convolution kernel \n",
"for each channel, adding the $c_i$ results together \n",
"(summing over the channels)\n",
"to yield a two-dimensional array. \n",
"This is the result of a two-dimensional cross-correlation \n",
"between multi-channel input data and \n",
"a *multi-input channel* convolution kernel.\n",
"\n",
"In the figure below, we demonstrate an example \n",
"of a two-dimensional cross-correlation with two input channels. \n",
"The shaded portions are the first output element \n",
"as well as the input and kernel array elements used in its computation: \n",
"$(1\\times1+2\\times2+4\\times3+5\\times4)+(0\\times0+1\\times1+3\\times2+4\\times3)=56$.\n",
"\n",
"![Cross-correlation computation with 2 input channels. The shaded portions are the first output element as well as the input and kernel array elements used in its computation: $(1\\times1+2\\times2+4\\times3+5\\times4)+(0\\times0+1\\times1+3\\times2+4\\times3)=56$. ](../img/conv_multi_in.svg)\n",
"\n",
"\n",
"To make sure we reall understand what's going on here,\n",
"we can implement cross-correlation operations with multiple input channels ourselves. \n",
"Notice that all we are doing is performing one cross-correlation operation \n",
"per channel and then adding up the results using the `add_n` function."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "1"
}
},
"outputs": [],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"import d2l\n",
"from mxnet import nd\n",
"\n",
"def corr2d_multi_in(X, K):\n",
" # First, traverse along the 0th dimension (channel dimension) of X and K.\n",
" # Then, add them together by using * to turn the result list into a\n",
" # positional argument of the add_n function\n",
" return nd.add_n(*[d2l.corr2d(x, k) for x, k in zip(X, K)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can construct the input array `X` and the kernel array `K` \n",
"corresponding to the values in the above diagram \n",
"to validate the output of the cross-correlation operation."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "2"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[[ 56. 72.]\n",
" [104. 120.]]\n",
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = nd.array([[[0, 1, 2], [3, 4, 5], [6, 7, 8]],\n",
" [[1, 2, 3], [4, 5, 6], [7, 8, 9]]])\n",
"K = nd.array([[[0, 1], [2, 3]], [[1, 2], [3, 4]]])\n",
"\n",
"corr2d_multi_in(X, K)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple Output Channels\n",
"\n",
"Regardless of the number of input channels, \n",
"so far we always ended up with one output channel. \n",
"However, as we discussed earlier, \n",
"it turns out to be essential to have multiple channels at each layer. \n",
"In the most popular neural network architectures, \n",
"we actually increase the channel dimension \n",
"as we go higher up in the neural network, \n",
"typically downsampling to trade off spatial resolution \n",
"for greater *channel depth*.\n",
"Intuitively, you could think of each channel \n",
"as responding to some different set of features. \n",
"Reality is a bit more complicated than the most naive intepretations of this intuition since representations aren't learned independent but are rather optimized to be jointly useful.\n",
"So it may not be that a single channel learns an edge detector but rather that some direction in channel space corresponds to detecting edges.\n",
"\n",
"\n",
"Denote by $c_i$ and $c_o$ the number \n",
"of input and output channels, respectively, \n",
"and let $k_h$ and $k_w$ be the height and width of the kernel. \n",
"To get an output with multiple channels, \n",
"we can create a kernel array \n",
"of shape $c_i\\times k_h\\times k_w$ \n",
"for each output channel. \n",
"We concatenate them on the output channel dimension,\n",
"so that the shape of the convolution kernel \n",
"is $c_o\\times c_i\\times k_h\\times k_w$. \n",
"In cross-correlation operations, \n",
"the result on each output channel is calculated \n",
"from the convolution kernel corrsponding to that output channel \n",
"and takes input from all channels in the input array.\n",
"\n",
"We implement a cross-correlation function \n",
"to calculate the output of multiple channels as shown below."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "3"
}
},
"outputs": [],
"source": [
"def corr2d_multi_in_out(X, K):\n",
" # Traverse along the 0th dimension of K, and each time, perform\n",
" # cross-correlation operations with input X. All of the results are merged\n",
" # together using the stack function\n",
" return nd.stack(*[corr2d_multi_in(X, k) for k in K])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We construct a convolution kernel with 3 output channels \n",
"by concatenating the kernel array `K` with `K+1` \n",
"(plus one for each element in `K`) and `K+2`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "4"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(3, 2, 2, 2)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"K = nd.stack(K, K + 1, K + 2)\n",
"K.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we perform cross-correlation operations \n",
"on the input array `X` with the kernel array `K`. \n",
"Now the output contains 3 channels. \n",
"The result of the first channel is consistent \n",
"with the result of the previous input array `X` \n",
"and the multi-input channel, \n",
"single-output channel kernel."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "5"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[[[ 56. 72.]\n",
" [104. 120.]]\n",
"\n",
" [[ 76. 100.]\n",
" [148. 172.]]\n",
"\n",
" [[ 96. 128.]\n",
" [192. 224.]]]\n",
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"corr2d_multi_in_out(X, K)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## $1\\times 1$ Convolutional Layer\n",
"\n",
"At first, a $1 \\times 1$ convolution, i.e. $k_h = k_w = 1$, \n",
"doesn't seem to make much sense. \n",
"After all, a convolution correlates adjacent pixels. \n",
"A $1 \\times 1$ convolution obviously doesn't. \n",
"Nonetheless, they are popular operations that are sometimes included\n",
"in the designs of complex deep networks. \n",
"Let's see in some detail what it actually does.\n",
"\n",
"Because the minimum window is used, \n",
"the $1\\times 1$ convolution loses the ability \n",
"of larger convolutional layers \n",
"to recognize patterns consisting of interactions\n",
"among adjacent elements in the height and width dimensions. \n",
"The only computation of the $1\\times 1$ convolution occurs \n",
"on the channel dimension. \n",
"\n",
"The figure below shows the cross-correlation computation \n",
"using the $1\\times 1$ convolution kernel \n",
"with 3 input channels and 2 output channels. \n",
"Note that the inputs and outputs have the same height and width. \n",
"Each element in the output is derived \n",
"from a linear combination of elements *at the same position* \n",
"in the input image. \n",
"You could think of the $1\\times 1$ convolutional layer \n",
"as constituting a fully-connected layer applied at every single pixel location\n",
"to transform the c_i corresponding input values into c_o output values.\n",
"Because this is still a convolutional layer, \n",
"the weights are tied across pixel location \n",
"Thus the $1\\times 1$ convolutional layer requires $c_o\\times c_i$ weights\n",
"(plus the bias terms).\n",
"\n",
"\n",
"![The cross-correlation computation uses the $1\\times 1$ convolution kernel with 3 input channels and 2 output channels. The inputs and outputs have the same height and width. ](../img/conv_1x1.svg)\n",
"\n",
"Let's check whether this works in practice: \n",
"we implement the $1 \\times 1$ convolution \n",
"using a fully-connected layer. \n",
"The only thing is that we need to make some adjustments \n",
"to the data shape before and after the matrix multiplication."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "6"
}
},
"outputs": [],
"source": [
"def corr2d_multi_in_out_1x1(X, K):\n",
" c_i, h, w = X.shape\n",
" c_o = K.shape[0]\n",
" X = X.reshape((c_i, h * w))\n",
" K = K.reshape((c_o, c_i))\n",
" Y = nd.dot(K, X) # Matrix multiplication in the fully connected layer\n",
" return Y.reshape((c_o, h, w))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When performing $1\\times 1$ convolution, \n",
"the above function is equivalent to the previously implemented cross-correlation function `corr2d_multi_in_out`. \n",
"Let's check this with some reference data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "7"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = nd.random.uniform(shape=(3, 3, 3))\n",
"K = nd.random.uniform(shape=(2, 3, 1, 1))\n",
"\n",
"Y1 = corr2d_multi_in_out_1x1(X, K)\n",
"Y2 = corr2d_multi_in_out(X, K)\n",
"\n",
"(Y1 - Y2).norm().asscalar() < 1e-6"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"* Multiple channels can be used to extend the model parameters of the convolutional layer.\n",
"* The $1\\times 1$ convolutional layer is equivalent to the fully-connected layer, when applied on a per pixel basis.\n",
"* The $1\\times 1$ convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity.\n",
"\n",
"\n",
"## Exercises\n",
"\n",
"1. Assume that we have two convolutional kernels of size $k_1$ and $k_2$ respectively (with no nonlinearity in between).\n",
" * Prove that the result of the operation can be expressed by a single convolution.\n",
" * What is the dimensionality of the equivalent single convolution?\n",
" * Is the converse true?\n",
"1. Assume an input shape of $c_i\\times h\\times w$ and a convolution kernel with the shape $c_o\\times c_i\\times k_h\\times k_w$, padding of $(p_h, p_w)$, and stride of $(s_h, s_w)$.\n",
" * What is the computational cost (multiplications and additions) for the forward computation?\n",
" * What is the memory footprint?\n",
" * What is the memory footprint for the backward computation?\n",
" * What is the computational cost for the backward computation?\n",
"1. By what factor does the number of calculations increase if we double the number of input channels $c_i$ and the number of output channels $c_o$? What happens if we double the padding?\n",
"1. If the height and width of the convolution kernel is $k_h=k_w=1$, what is the complexity of the forward computation?\n",
"1. Are the variables `Y1` and `Y2` in the last example of this section exactly the same? Why?\n",
"1. How would you implement convolutions using matrix multiplication when the convolution window is not $1\\times 1$?\n",
"\n",
"## Scan the QR Code to [Discuss](https://discuss.mxnet.io/t/2351)\n",
"\n",
"![](../img/qr_channels.svg)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}