{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fully Convolutional Networks (FCN)\n",
"\n",
"We previously discussed semantic segmentation using each pixel in an image for category prediction. A fully convolutional network (FCN) uses a convolutional neural network to transform image pixels to pixel categories. Unlike the convolutional neural networks previously introduced, an FCN transforms the height and width of the intermediate layer feature map back to the size of input image through the transposed convolution layer, so that the predictions have a one-to-one correspondence with input image in spatial dimension (height and width). Given a position on the spatial dimension, the output of the channel dimension will be a category prediction of the pixel corresponding to the location.\n",
"\n",
"We will first import the package or module needed for the experiment and then explain the transposed convolution layer."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "2"
}
},
"outputs": [],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"%matplotlib inline\n",
"import d2l\n",
"from mxnet import gluon, image, init, nd\n",
"from mxnet.gluon import data as gdata, loss as gloss, model_zoo, nn\n",
"import numpy as np\n",
"import sys"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transposed Convolution Layer\n",
"\n",
"The transposed convolution layer takes its name from the matrix transposition operation. In fact, convolution operations can also be achieved by matrix multiplication. In the example below, we define input `X` with a height and width of 4 respectively, and a convolution kernel `K` with a height and width of 3 respectively. Print the output of the 2D convolution operation and the convolution kernel. As you can see, the output has a height and a width of 2."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(\n",
" [[[[348. 393.]\n",
" [528. 573.]]]]\n",
" , \n",
" [[[[1. 2. 3.]\n",
" [4. 5. 6.]\n",
" [7. 8. 9.]]]]\n",
" )"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = nd.arange(1, 17).reshape((1, 1, 4, 4))\n",
"K = nd.arange(1, 10).reshape((1, 1, 3, 3))\n",
"conv = nn.Conv2D(channels=1, kernel_size=3)\n",
"conv.initialize(init.Constant(K))\n",
"conv(X), K"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we rewrite convolution kernel `K` as a sparse matrix `W` with a large number of zero elements, i.e. a weight matrix. The shape of the weight matrix is (4,16), where the non-zero elements are taken from the elements in convolution kernel `K`. Enter `X` and concatenate line by line to get a vector of length 16. Then, perform matrix multiplication for `W` and the `X` vector to get a vector of length 4. After the transformation, we can get the same result as the convolution operation above. As you can see, in this example, we implement the convolution operation using matrix multiplication."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(\n",
" [[[[348. 393.]\n",
" [528. 573.]]]]\n",
" , \n",
" [[1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0. 0. 0. 0. 0.]\n",
" [0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0. 0. 0. 0.]\n",
" [0. 0. 0. 0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9. 0.]\n",
" [0. 0. 0. 0. 0. 1. 2. 3. 0. 4. 5. 6. 0. 7. 8. 9.]]\n",
" )"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"W, k = nd.zeros((4, 16)), nd.zeros(11)\n",
"k[:3], k[4:7], k[8:] = K[0, 0, 0, :], K[0, 0, 1, :], K[0, 0, 2, :]\n",
"W[0, 0:11], W[1, 1:12], W[2, 4:15], W[3, 5:16] = k, k, k, k\n",
"nd.dot(W, X.reshape(16)).reshape((1, 1, 2, 2)), W"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will describe the convolution operation from the perspective of matrix multiplication. Let the input vector be $\\boldsymbol{x}$ and weight matrix be $\\boldsymbol{W}$. The implementation of the convolutional forward computation function can be considered as the multiplication of the function input by the weight matrix to output the vector $\\boldsymbol{ y} = \\boldsymbol{W}\\boldsymbol{x}$. We know that back propagation needs to be based on chain rules. Because $\\nabla_{\\boldsymbol{x}} \\boldsymbol{y} = \\boldsymbol{W}^\\top$, the implementation of the convolutional back propagation function can be considered as the multiplication of the function input by the transposed weight matrix $\\boldsymbol{W}^\\top$. The transposed convolution layer exchanges the forward computation function and the back propagation function of the convolution layer. These two functions can be regarded as the multiplication of the function input vectors by $\\boldsymbol{W}^\\top$ and $\\boldsymbol{W}$, respectively.\n",
"\n",
"It is not difficult to see that the transposed convolution layer can be used to exchange the shape of input and output of the convolution layer. Let us continue to describe convolution using matrix multiplication. Let the weight matrix be a matrix with a shape of $4\\times16$. For an input vector of length 16, the convolution forward computation outputs a vector with a length of 4. If the length of the input vector is 4 and the shape of the transpose weight matrix is $16\\times4$, then the transposed convolution layer outputs a vector of length 16. In model design, transposed convolution layers are often used to transform smaller feature maps into larger ones. In a full convolutional network, when the input is a feature map with a high height and a wide width, the transposed convolution layer can be used to magnify the height and width to the size of the input image.\n",
"\n",
"Now we will look at an example. Construct a convolution layer `conv` and let shape of input `X` be (1,3,64,64). The number of channels for convolution output `Y` is increased to 10, but the height and width are reduced by half."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "3"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1, 10, 32, 32)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conv = nn.Conv2D(10, kernel_size=4, padding=1, strides=2)\n",
"conv.initialize()\n",
"\n",
"X = nd.random.uniform(shape=(1, 3, 64, 64))\n",
"Y = conv(X)\n",
"Y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we construct transposed convolution layer `conv_trans` by creating a `Conv2DTranspose` instance. Here, we assume the convolution kernel shape, padding, and stride of `conv_trans` are the same with those in `conv`, and the number of output channels is 3. When the input is output `Y` of the convolution layer `conv`, the transposed convolution layer output has the same height and width as convolution layer input. The transposed convolution layer magnifies the height and width of the feature map by a factor of 2."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "4"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1, 3, 64, 64)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2)\n",
"conv_trans.initialize()\n",
"conv_trans(Y).shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the literature, transposed convolution is also sometimes referred to as fractionally-strided convolution[2].\n",
"\n",
"\n",
"## Construct a Model\n",
"\n",
"Here, we demonstrate the most basic design of a fully convolutional network model. As shown in Figure 9.11, the fully convolutional network first uses the convolutional neural network to extract image features, then transforms the number of channels into the number of categories through the $1\\times 1$ convolution layer, and finally transforms the height and width of the feature map to the size of the input image by using the transposed convolution layer. The model output has the same height and width as the input image and has a one-to-one correspondence in spatial positions. The final output channel contains the category prediction of the pixel of the corresponding spatial position.\n",
"\n",
"![Fully convolutional network. ](../img/fcn.svg)\n",
"\n",
"Below, we use a ResNet-18 model pre-trained on the ImageNet data set to extract image features and record the network instance as `pretrained_net`. As you can see, the last two layers of the model member variable `features` are the global maximum pooling layer `GlobalAvgPool2D` and example flattening layer `Flatten`. The `output` module contains the fully connected layer used for output. These layers are not required for a fully convolutional network."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "5"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(HybridSequential(\n",
" (0): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)\n",
" (1): Activation(relu)\n",
" (2): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True)\n",
" (3): Flatten\n",
" ), Dense(512 -> 1000, linear))"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pretrained_net = model_zoo.vision.resnet18_v2(pretrained=True)\n",
"pretrained_net.features[-4:], pretrained_net.output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we create the fully convolutional network instance `net`. It duplicates all the neural layers except the last two layers of the instance member variable `features` of `pretrained_net` and the model parameters obtained after pre-training."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "6"
}
},
"outputs": [],
"source": [
"net = nn.HybridSequential()\n",
"for layer in pretrained_net.features[:-2]:\n",
" net.add(layer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given an input of a height and width of 320 and 480 respectively, the forward computation of `net` will reduce the height and width of the input to $1/32$ of the original, i.e. 10 and 15."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "7"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1, 512, 10, 15)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = nd.random.uniform(shape=(1, 3, 320, 480))\n",
"net(X).shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we transform the number of output channels to the number of categories of Pascal VOC2012 (21) through the $1\\times 1$ convolution layer. Finally, we need to magnify the height and width of the feature map by a factor of 32 to change them back to the height and width of the input image. Recall the calculation method for the convolution layer output shape described in the section [\"Padding and Stride\"](../chapter_convolutional-neural-networks/padding-and-strides.md). Because $(320-64+16\\times2+32)/32=10$ and $(480-64+16\\times2+32)/32=15$, we construct a transposed convolution layer with a stride of 32 and set the height and width of the convolution kernel to 64 and the padding to 16. It is not difficult to see that, if the stride is $s$, the padding is $s/2$ (assuming $s/2$ is an integer), and the height and width of the convolution kernel are $2s$, the transposed convolution kernel will magnify both the height and width of the input by a factor of $s$."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "8"
}
},
"outputs": [],
"source": [
"num_classes = 21\n",
"net.add(nn.Conv2D(num_classes, kernel_size=1),\n",
" nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16,\n",
" strides=32))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize the Transposed Convolution Layer\n",
"\n",
"We already know that the transposed convolution layer can magnify a feature map. In image processing, sometimes we need to magnify the image, i.e. upsampling. There are many methods for upsampling, and one common method is bilinear interpolation. Simply speaking, in order to get the pixel of the output image at the coordinates $(x, y)$, the coordinates are first mapped to the coordinates of the input image $(x', y')$. This can be done based on the ratio of the size of thee input to the size of the output. The mapped values $x'$ and $y'$ are usually real numbers. Then, we find the four pixels closest to the coordinate $(x', y')$ on the input image. Finally, the pixels of the output image at coordinates $(x, y)$ are calculated based on these four pixels on the input image and their relative distances to $(x', y')$. Upsampling by bilinear interpolation can be implemented by transposed convolution layer of the convolution kernel constructed using the following `bilinear_kernel` function. Due to space limitations, we only give the implementation of the `bilinear_kernel` function and will not discuss the principles of the algorithm."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "9"
}
},
"outputs": [],
"source": [
"def bilinear_kernel(in_channels, out_channels, kernel_size):\n",
" factor = (kernel_size + 1) // 2\n",
" if kernel_size % 2 == 1:\n",
" center = factor - 1\n",
" else:\n",
" center = factor - 0.5\n",
" og = np.ogrid[:kernel_size, :kernel_size]\n",
" filt = (1 - abs(og[0] - center) / factor) * \\\n",
" (1 - abs(og[1] - center) / factor)\n",
" weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size),\n",
" dtype='float32')\n",
" weight[range(in_channels), range(out_channels), :, :] = filt\n",
" return nd.array(weight)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we will experiment with bilinear interpolation upsampling implemented by transposed convolution layers. Construct a transposed convolution layer that magnifies height and width of input by a factor of 2 and initialize its convolution kernel with the `bilinear_kernel` function."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "11"
}
},
"outputs": [],
"source": [
"conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2)\n",
"conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the image `X` and record the result of upsampling as `Y`. In order to print the image, we need to adjust the position of the channel dimension."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"img = image.imread('../img/catdog.jpg')\n",
"X = img.astype('float32').transpose((2, 0, 1)).expand_dims(axis=0) / 255\n",
"Y = conv_trans(X)\n",
"out_img = Y[0].transpose((1, 2, 0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the transposed convolution layer magnifies both the height and width of the image by a factor of 2. It is worth mentioning that, besides to the difference in coordinate scale, the image magnified by bilinear interpolation and original image printed in the [\"Object Detection and Bounding Box\"](bounding-box.md) section look the same."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input image shape: (561, 728, 3)\n",
"output image shape: (1122, 1456, 3)\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"