.. _sec_fcn:
Fully Convolutional Networks (FCN)
==================================
We previously discussed semantic segmentation using each pixel in an
image for category prediction. A fully convolutional network (FCN)
:cite:`Long.Shelhamer.Darrell.2015` uses a convolutional neural
network to transform image pixels to pixel categories. Unlike the
convolutional neural networks previously introduced, an FCN transforms
the height and width of the intermediate layer feature map back to the
size of input image through the transposed convolution layer, so that
the predictions have a one-to-one correspondence with input image in
spatial dimension (height and width). Given a position on the spatial
dimension, the output of the channel dimension will be a category
prediction of the pixel corresponding to the location.
We will first import the package or module needed for the experiment and
then explain the transposed convolution layer.
.. code:: python
%matplotlib inline
import d2l
from mxnet import gluon, image, init, np, npx
from mxnet.gluon import nn
npx.set_np()
Constructing a Model
--------------------
Here, we demonstrate the most basic design of a fully convolutional
network model. As shown in :numref:`fig_fcn`, the fully convolutional
network first uses the convolutional neural network to extract image
features, then transforms the number of channels into the number of
categories through the :math:`1\times 1` convolution layer, and finally
transforms the height and width of the feature map to the size of the
input image by using the transposed convolution layer
:numref:`sec_transposed_conv`. The model output has the same height
and width as the input image and has a one-to-one correspondence in
spatial positions. The final output channel contains the category
prediction of the pixel of the corresponding spatial position.
.. _fig_fcn:
.. figure:: ../img/fcn.svg
Fully convolutional network.
Below, we use a ResNet-18 model pre-trained on the ImageNet dataset to
extract image features and record the network instance as
``pretrained_net``. As you can see, the last two layers of the model
member variable ``features`` are the global maximum pooling layer
``GlobalAvgPool2D`` and example flattening layer ``Flatten``. The
``output`` module contains the fully connected layer used for output.
These layers are not required for a fully convolutional network.
.. code:: python
pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True)
pretrained_net.features[-4:], pretrained_net.output
.. parsed-literal::
:class: output
(HybridSequential(
(0): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
(1): Activation(relu)
(2): GlobalAvgPool2D(size=(1, 1), stride=(1, 1), padding=(0, 0), ceil_mode=True, global_pool=True, pool_type=avg, layout=NCHW)
(3): Flatten
),
Dense(512 -> 1000, linear))
Next, we create the fully convolutional network instance ``net``. It
duplicates all the neural layers except the last two layers of the
instance member variable ``features`` of ``pretrained_net`` and the
model parameters obtained after pre-training.
.. code:: python
net = nn.HybridSequential()
for layer in pretrained_net.features[:-2]:
net.add(layer)
Given an input of a height and width of 320 and 480 respectively, the
forward computation of ``net`` will reduce the height and width of the
input to :math:`1/32` of the original, i.e., 10 and 15.
.. code:: python
X = np.random.uniform(size=(1, 3, 320, 480))
net(X).shape
.. parsed-literal::
:class: output
(1, 512, 10, 15)
Next, we transform the number of output channels to the number of
categories of Pascal VOC2012 (21) through the :math:`1\times 1`
convolution layer. Finally, we need to magnify the height and width of
the feature map by a factor of 32 to change them back to the height and
width of the input image. Recall the calculation method for the
convolution layer output shape described in :numref:`sec_padding`.
Because :math:`(320-64+16\times2+32)/32=10` and
:math:`(480-64+16\times2+32)/32=15`, we construct a transposed
convolution layer with a stride of 32 and set the height and width of
the convolution kernel to 64 and the padding to 16. It is not difficult
to see that, if the stride is :math:`s`, the padding is :math:`s/2`
(assuming :math:`s/2` is an integer), and the height and width of the
convolution kernel are :math:`2s`, the transposed convolution kernel
will magnify both the height and width of the input by a factor of
:math:`s`.
.. code:: python
num_classes = 21
net.add(nn.Conv2D(num_classes, kernel_size=1),
nn.Conv2DTranspose(
num_classes, kernel_size=64, padding=16, strides=32))
Initializing the Transposed Convolution Layer
---------------------------------------------
We already know that the transposed convolution layer can magnify a
feature map. In image processing, sometimes we need to magnify the
image, i.e., upsampling. There are many methods for upsampling, and one
common method is bilinear interpolation. Simply speaking, in order to
get the pixel of the output image at the coordinates :math:`(x, y)`, the
coordinates are first mapped to the coordinates of the input image
:math:`(x', y')`. This can be done based on the ratio of the size of
three input to the size of the output. The mapped values :math:`x'` and
:math:`y'` are usually real numbers. Then, we find the four pixels
closest to the coordinate :math:`(x', y')` on the input image. Finally,
the pixels of the output image at coordinates :math:`(x, y)` are
calculated based on these four pixels on the input image and their
relative distances to :math:`(x', y')`. Upsampling by bilinear
interpolation can be implemented by transposed convolution layer of the
convolution kernel constructed using the following ``bilinear_kernel``
function. Due to space limitations, we only give the implementation of
the ``bilinear_kernel`` function and will not discuss the principles of
the algorithm.
.. code:: python
def bilinear_kernel(in_channels, out_channels, kernel_size):
factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = (np.arange(kernel_size).reshape(-1, 1),
np.arange(kernel_size).reshape(1, -1))
filt = (1 - np.abs(og[0] - center) / factor) * \
(1 - np.abs(og[1] - center) / factor)
weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size))
weight[range(in_channels), range(out_channels), :, :] = filt
return np.array(weight)
Now, we will experiment with bilinear interpolation upsampling
implemented by transposed convolution layers. Construct a transposed
convolution layer that magnifies height and width of input by a factor
of 2 and initialize its convolution kernel with the ``bilinear_kernel``
function.
.. code:: python
conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2)
conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))
Read the image ``X`` and record the result of upsampling as ``Y``. In
order to print the image, we need to adjust the position of the channel
dimension.
.. code:: python
img = image.imread('../img/catdog.jpg')
X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0) / 255
Y = conv_trans(X)
out_img = Y[0].transpose(1, 2, 0)
As you can see, the transposed convolution layer magnifies both the
height and width of the image by a factor of 2. It is worth mentioning
that, besides to the difference in coordinate scale, the image magnified
by bilinear interpolation and original image printed in
:numref:`sec_bbox` look the same.
.. code:: python
d2l.set_figsize((3.5, 2.5))
print('input image shape:', img.shape)
d2l.plt.imshow(img.asnumpy());
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img.asnumpy());
.. parsed-literal::
:class: output
input image shape: (561, 728, 3)
output image shape: (1122, 1456, 3)
.. figure:: output_fcn_8421ff_17_1.svg
In a fully convolutional network, we initialize the transposed
convolution layer for upsampled bilinear interpolation. For a
:math:`1\times 1` convolution layer, we use Xavier for randomly
initialization.
.. code:: python
W = bilinear_kernel(num_classes, num_classes, 64)
net[-1].initialize(init.Constant(W))
net[-2].initialize(init=init.Xavier())
Reading the Dataset
-------------------
We read the dataset using the method described in the previous section.
Here, we specify shape of the randomly cropped output image as
:math:`320\times 480`, so both the height and width are divisible by 32.
.. code:: python
batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
.. parsed-literal::
:class: output
Downloading ../data/VOCtrainval_11-May-2012.tar from http://d2l-data.s3-accelerate.amazonaws.com/VOCtrainval_11-May-2012.tar...
read 1114 examples
read 1078 examples
Training
--------
Now we can start training the model. The loss function and accuracy
calculation here are not substantially different from those used in
image classification. Because we use the channel of the transposed
convolution layer to predict pixel categories, the ``axis=1`` (channel
dimension) option is specified in ``SoftmaxCrossEntropyLoss``. In
addition, the model calculates the accuracy based on whether the
prediction category of each pixel is correct.
.. code:: python
num_epochs, lr, wd, ctx = 5, 0.1, 1e-3, d2l.try_all_gpus()
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
net.collect_params().reset_ctx(ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': lr, 'wd': wd})
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, ctx)
.. parsed-literal::
:class: output
loss 0.315, train acc 0.896, test acc 0.852
301.5 examples/sec on [gpu(0), gpu(1)]
.. figure:: output_fcn_8421ff_23_1.svg
Prediction
----------
During predicting, we need to standardize the input image in each
channel and transform them into the four-dimensional input format
required by the convolutional neural network.
.. code:: python
def predict(img):
X = test_iter._dataset.normalize_image(img)
X = np.expand_dims(X.transpose(2, 0, 1), axis=0)
pred = net(X.as_in_context(ctx[0])).argmax(axis=1)
return pred.reshape(pred.shape[1], pred.shape[2])
To visualize the predicted categories for each pixel, we map the
predicted categories back to their labeled colors in the dataset.
.. code:: python
def label2image(pred):
colormap = np.array(d2l.VOC_COLORMAP, ctx=ctx[0], dtype='uint8')
X = pred.astype('int32')
return colormap[X, :]
The size and shape of the images in the test dataset vary. Because the
model uses a transposed convolution layer with a stride of 32, when the
height or width of the input image is not divisible by 32, the height or
width of the transposed convolution layer output deviates from the size
of the input image. In order to solve this problem, we can crop multiple
rectangular areas in the image with heights and widths as integer
multiples of 32, and then perform forward computation on the pixels in
these areas. When combined, these areas must completely cover the input
image. When a pixel is covered by multiple areas, the average of the
transposed convolution layer output in the forward computation of the
different areas can be used as an input for the softmax operation to
predict the category.
For the sake of simplicity, we only read a few large test images and
crop an area with a shape of :math:`320\times480` from the top-left
corner of the image. Only this area is used for prediction. For the
input image, we print the cropped area first, then print the predicted
result, and finally print the labeled category.
.. code:: python
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
crop_rect = (0, 0, 480, 320)
X = image.fixed_crop(test_images[i], *crop_rect)
pred = label2image(predict(X))
imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);
.. figure:: output_fcn_8421ff_29_0.svg
Summary
-------
- The fully convolutional network first uses the convolutional neural
network to extract image features, then transforms the number of
channels into the number of categories through the :math:`1\times 1`
convolution layer, and finally transforms the height and width of the
feature map to the size of the input image by using the transposed
convolution layer to output the category of each pixel.
- In a fully convolutional network, we initialize the transposed
convolution layer for upsampled bilinear interpolation.
Exercises
---------
1. If we use Xavier to randomly initialize the transposed convolution
layer, what will happen to the result?
2. Can you further improve the accuracy of the model by tuning the
hyper-parameters?
3. Predict the categories of all pixels in the test image.
4. The outputs of some intermediate layers of the convolutional neural
network are also used in the paper on fully convolutional
networks[1]. Try to implement this idea.
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_fcn.svg