From Dense Layers to Convolutions
=================================
The models that we have discussed so far are fine options if you are
dealing with *tabular* data. By *tabular* we mean that the data consists
of rows corresponding to examples and columns corresponding to features.
With tabular data, we might anticipate that pattern we seek could
require modeling interactions among the features, but do not assume
anything a priori about which features are related to each other or in
what way.
Sometimes we truly may not have any knowledge to guide the construction
of more cleverly-organized architectures. In these cases, a multilayer
perceptron is often the best that we can do. However, once we start
dealing with high-dimensional perceptual data, these *structure-less*
networks can grow unwieldy.
For instance, let’s return to our running example of distinguishing cats
from dogs. Say that we do a thorough job in data collection, collecting
an annotated sets of high-quality 1-megapixel photographs. This means
that the input into a network has *1 million dimensions*. Even an
aggressive reduction to *1,000 hidden dimensions* would require a
*dense* (fully-connected) layer to support :math:`10^9` parameters.
Unless we have an extremely large dataset (perhaps billions?), lots of
GPUs, a talent for extreme distributed optimization, and an
extraordinary amount of patience, learning the parameters of this
network may turn out to be impossible.
A careful reader might object to this argument on the basis that 1
megapixel resolution may not be necessary. However, while you could get
away with 100,000 pixels, we grossly underestimated the number of hidden
nodes that it typically takes to learn good hidden representations of
images. Learning a binary classifier with so many parameters might seem
to require that we collect an enormous dataset, perhaps comparable to
the number of dogs and cats on the planet. And yet both humans and
computers are able to distinguish cats from dogs quite well, seemingly
contradicting these conclusions. That is because images exhibit rich
structure that is typically exploited by humans and machine learning
models alike.
Invariances
-----------
Imagine that you want to detect an object in an image. It seems
reasonable that whatever method we use to recognize objects should not
be overly concerned with the precise *location* of the object shouldn’t
in the image. Ideally we could learn a system that would somehow exploit
this knowledge. Pigs usually do not fly and planes usually do not swim.
Nonetheless, we could still recognize a flying pig were one to appear.
This ideas is taken to an extreme in the children’s game ‘Where’s
Waldo’, an example is shown in :numref:`img_waldo`. The game consists
of a number of chaotic scenes bursting with activity and Waldo shows up
somewhere in each (typically lurking in some unlikely location). The
reader’s goal is to locate him. Despite his characteristic outfit, this
can be surprisingly difficult, due to the large number of confounders.
.. _img_waldo:
.. figure:: ../img/where-wally-walker-books.jpg
:width: 400px
Image via Walker Books
Back to images, the intuitions we have been discussing could be made
more concrete yielding a few key principles for building neural networks
for computer vision:
1. Our vision systems should, in some sense, respond similarly to the
same object regardless of where it appears in the image (translation
invariance).
2. Our visions systems should, in some sense, focus on local regions,
without regard for what else is happening in the image at greater
distances (locality).
Let’s see how this translates into mathematics.
Constraining the MLP
--------------------
.. raw:: html
To start off let’s consider what an MLP would look like with
:math:`h \times w` images as inputs (represented as matrices in math,
and as 2D arrays in code), and hidden representations similarly
organized as :math:`h \times w` matrices / 2D arrays. Let
:math:`x[i, j]` and :math:`h[i, j]` denote pixel location :math:`(i, j)`
in an image and hidden representation, respectively. Consequently, to
have each of the :math:`hw` hidden nodes receive input from each of the
:math:`hw` inputs, we would switch from using weight matrices (as we did
previously in MLPs) to representing our parameters as four-dimensional
weight tensors.
We could formally express this dense layer as follows:
.. math::
h[i, j] = u[i, j] + \sum_{k, l} W[i, j, k, l] \cdot x[k, l] = u[i, j] +
\sum_{a, b} V[i, j, a, b] \cdot x[i+a, j+b].
The switch from :math:`W` to :math:`V` is entirely cosmetic (for now)
since there is a one-to-one correspondence between coefficients in both
tensors. We simply re-index the subscripts :math:`(k, l)` such that
:math:`k = i+a` and :math:`l = j+b`. In other words, we set
:math:`V[i, j, a, b] = W[i, j, i+a, j+b]`. The indices :math:`a, b` run
over both positive and negative offsets, covering the entire image. For
any given location :math:`(i, j)` in the hidden layer :math:`h[i, j]`,
we compute its value by summing over pixels in :math:`x`, centered
around :math:`(i, j)` and weighted by :math:`V[i, j, a, b]`.
Now let’s invoke the first principle we established above: *translation
invariance*. This implies that a shift in the inputs :math:`x` should
simply lead to a shift in the activations :math:`h`. This is only
possible if :math:`V` and :math:`u` do not actually depend on
:math:`(i, j)`, i.e., we have :math:`V[i, j, a, b] = V[a, b]` and
:math:`u` is a constant. As a result we can simplify the definition for
:math:`h`.
.. math:: h[i, j] = u + \sum_{a, b} V[a, b] \cdot x[i+a, j+b].
This is a convolution! We are effectively weighting pixels
:math:`(i+a, j+b)` in the vicinity of :math:`(i, j)` with coefficients
:math:`V[a, b]` to obtain the value :math:`h[i, j]`. Note that
:math:`V[a, b]` needs many fewer coefficients than
:math:`V[i, j, a, b]`. For a 1 megapixel image it has at most 1 million
coefficients. This is 1 million fewer parameters since it no longer
depends on the location within the image. We have made significant
progress!
Now let’s invoke the second principle—\ *locality*. As motivated above,
we believe that we shouldn’t have to look very far away from
:math:`(i, j)` in order to glean relevant information to assess what is
going on at :math:`h[i, j]`. This means that outside some range
:math:`|a|, |b| > \Delta`, we should set :math:`V[a, b] = 0`.
Equivalently, we can rewrite :math:`h[i, j]` as
.. math:: h[i, j] = u + \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} V[a, b] \cdot x[i+a, j+b].
This, in a nutshell is the convolutional layer. When the local region
(also called a *receptive field*) is small, the difference as compared
to a fully-connected network can be dramatic. While previously, we might
have required billions of parameters to represent just a single layer in
an image-processing network, we now typically need just a few hundred.
The price that we pay for this drastic modification is that our features
will be translation invariant and that our layer can only take local
information into account. All learning depends on imposing inductive
bias. When that bias agrees with reality, we get sample-efficient models
that generalize well to unseen data. But of course, if those biases do
not agree with reality, e.g., if images turned out not to be translation
invariant, our models may not generalize well.
Convolutions
------------
Let’s briefly review why the above operation is called a *convolution*.
In mathematics, the convolution between two functions, say
:math:`f, g: \mathbb{R}^d \to R` is defined as
.. math:: [f \circledast g](x) = \int_{\mathbb{R}^d} f(z) g(x-z) dz.
That is, we measure the overlap between :math:`f` and :math:`g` when
both functions are shifted by :math:`x` and “flipped”. Whenever we have
discrete objects, the integral turns into a sum. For instance, for
vectors defined on :math:`\ell_2`, i.e., the set of square summable
infinite dimensional vectors with index running over :math:`\mathbb{Z}`
we obtain the following definition.
.. math:: [f \circledast g](i) = \sum_a f(a) g(i-a).
For two-dimensional arrays, we have a corresponding sum with indices
:math:`(i, j)` for :math:`f` and :math:`(i-a, j-b)` for :math:`g`
respectively. This looks similar to definition above, with one major
difference. Rather than using :math:`(i+a, j+b)`, we are using the
difference instead. Note, though, that this distinction is mostly
cosmetic since we can always match the notation by using
:math:`\tilde{V}[a, b] = V[-a, -b]` to obtain
:math:`h = x \circledast \tilde{V}`. Also note that the original
definition is actually a *cross correlation*. We will come back to this
in the following section.
Waldo Revisited
---------------
Let’s see what this looks like if we want to build an improved Waldo
detector. The convolutional layer picks windows of a given size and
weighs intensities according to the mask :math:`V`, as demonstrated in
:numref:`fig_waldo_mask`. We expect that wherever the “waldoness” is
highest, we will also find a peak in the hidden layer activations.
.. _fig_waldo_mask:
.. figure:: ../img/waldo-mask.jpg
:width: 400px
Find Waldo.
There is just a problem with this approach: so far we blissfully ignored
that images consist of 3 channels: red, green and blue. In reality,
images are quite two-dimensional objects but rather as a
:math:`3^{\mathrm{rd}}` order tensor, e.g., with shape
:math:`1024 \times 1024 \times 3` pixels. Only two of these axes concern
spatial relationships, while the :math:`3^{\mathrm{rd}}` can be regarded
as assigning a multidimensional representation *to each pixel location*.
We thus index :math:`\mathbf{x}` as :math:`x[i, j, k]`. The
convolutional mask has to adapt accordingly. Instead of :math:`V[a, b]`
we now have :math:`V[a, b, c]`.
Moreover, just as our input consists of a :math:`3^{\mathrm{rd}}` order
tensor it turns out to be a good idea to similarly formulate our hidden
representations as :math:`3^{\mathrm{rd}}` order tensors. In other
words, rather than just having a 1D representation corresponding to each
spatial location, we want to have a multidimensional hidden
representations corresponding to each spatial location. We could think
of the hidden representation as comprising a number of 2D grids stacked
on top of each other. These are sometimes called *channels* or *feature
maps*. Intuitively you might imagine that at lower layers, some channels
specialize to recognizing edges, We can take care of this by adding a
fourth coordinate to :math:`V` via :math:`V[a, b, c, d]`. Putting all
together we have:
.. math:: h[i, j, k] = \sum_{a = -\Delta}^{\Delta} \sum_{b = -\Delta}^{\Delta} \sum_c V[a, b, c, k] \cdot x[i+a, j+b, c].
This is the definition of a convolutional neural network layer. There
are still many operations that we need to address. For instance, we need
to figure out how to combine all the activations to a single output
(e.g., whether there is a Waldo in the image). We also need to decide
how to compute things efficiently, how to combine multiple layers, and
whether it is a good idea to have many narrow or a few wide layers. All
of this will be addressed in the remainder of the chapter.
Summary
-------
- Translation invariance in images implies that all patches of an image
will be treated in the same manner.
- Locality means that only a small neighborhood of pixels will be used
for computation.
- Channels on input and output allows for meaningful feature analysis.
Exercises
---------
1. Assume that the size of the convolution mask is :math:`\Delta = 0`.
Show that in this case the convolutional mask implements an MLP
independently for each set of channels.
2. Why might translation invariance not be a good idea after all? Does
it make sense for pigs to fly?
3. What happens at the boundary of an image?
4. Derive an analogous convolutional layer for audio.
5. What goes wrong when you apply the above reasoning to text? Hint:
what is the structure of language?
6. Prove that :math:`f \circledast g = g \circledast f`.
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_why-conv.svg