# 6.1. From Dense Layers to Convolutions¶ Open the notebook in Colab

The models that we have discussed so far are fine options if you are
dealing with *tabular* data. By *tabular* we mean that the data consists
of rows corresponding to examples and columns corresponding to features.
With tabular data, we might anticipate that pattern we seek could
require modeling interactions among the features, but do not assume
anything a priori about which features are related to each other or in
what way.

Sometimes we truly may not have any knowledge to guide the construction
of more cleverly-organized architectures. In these cases, a multilayer
perceptron is often the best that we can do. However, once we start
dealing with high-dimensional perceptual data, these *structure-less*
networks can grow unwieldy.

For instance, let us return to our running example of distinguishing
cats from dogs. Say that we do a thorough job in data collection,
collecting an annotated sets of high-quality 1-megapixel photographs.
This means that the input into a network has *1 million dimensions*.
Even an aggressive reduction to *1,000 hidden dimensions* would require
a *dense* (fully-connected) layer to support \(10^9\) parameters.
Unless we have an extremely large dataset (perhaps billions?), lots of
GPUs, a talent for extreme distributed optimization, and an
extraordinary amount of patience, learning the parameters of this
network may turn out to be impossible.

A careful reader might object to this argument on the basis that 1 megapixel resolution may not be necessary. However, while you could get away with 100,000 pixels, we grossly underestimated the number of hidden nodes that it typically takes to learn good hidden representations of images. Learning a binary classifier with so many parameters might seem to require that we collect an enormous dataset, perhaps comparable to the number of dogs and cats on the planet. And yet both humans and computers are able to distinguish cats from dogs quite well, seemingly contradicting these conclusions. That is because images exhibit rich structure that is typically exploited by humans and machine learning models alike.

## 6.1.1. Invariances¶

Imagine that you want to detect an object in an image. It seems
reasonable that whatever method we use to recognize objects should not
be overly concerned with the precise *location* of the object in the
image. Ideally we could learn a system that would somehow exploit this
knowledge. Pigs usually do not fly and planes usually do not swim.
Nonetheless, we could still recognize a flying pig were one to appear.
This ideas is taken to an extreme in the children’s game ‘Where’s
Waldo’, an example is shown in Fig. 6.1.1. The game consists
of a number of chaotic scenes bursting with activity and Waldo shows up
somewhere in each (typically lurking in some unlikely location). The
reader’s goal is to locate him. Despite his characteristic outfit, this
can be surprisingly difficult, due to the large number of confounders.

Back to images, the intuitions we have been discussing could be made more concrete yielding a few key principles for building neural networks for computer vision:

Our vision systems should, in some sense, respond similarly to the same object regardless of where it appears in the image (translation invariance).

Our visions systems should, in some sense, focus on local regions, without regard for what else is happening in the image at greater distances (locality).

Let us see how this translates into mathematics.

## 6.1.2. Constraining the MLP¶

To start off let us consider what an MLP would look like with \(h \times w\) images as inputs (represented as matrices in math, and as 2D arrays in code), and hidden representations similarly organized as \(h \times w\) matrices / 2D arrays. Let \(x[i, j]\) and \(h[i, j]\) denote pixel location \((i, j)\) in an image and hidden representation, respectively. Consequently, to have each of the \(hw\) hidden nodes receive input from each of the \(hw\) inputs, we would switch from using weight matrices (as we did previously in MLPs) to representing our parameters as four-dimensional weight tensors.

We could formally express this dense layer as follows:

The switch from \(W\) to \(V\) is entirely cosmetic (for now) since there is a one-to-one correspondence between coefficients in both tensors. We simply re-index the subscripts \((k, l)\) such that \(k = i+a\) and \(l = j+b\). In other words, we set \(V[i, j, a, b] = W[i, j, i+a, j+b]\). The indices \(a, b\) run over both positive and negative offsets, covering the entire image. For any given location \((i, j)\) in the hidden layer \(h[i, j]\), we compute its value by summing over pixels in \(x\), centered around \((i, j)\) and weighted by \(V[i, j, a, b]\).

Now let us invoke the first principle we established above: *translation
invariance*. This implies that a shift in the inputs \(x\) should
simply lead to a shift in the activations \(h\). This is only
possible if \(V\) and \(u\) do not actually depend on
\((i, j)\), i.e., we have \(V[i, j, a, b] = V[a, b]\) and
\(u\) is a constant. As a result we can simplify the definition for
\(h\).

This is a convolution! We are effectively weighting pixels \((i+a, j+b)\) in the vicinity of \((i, j)\) with coefficients \(V[a, b]\) to obtain the value \(h[i, j]\). Note that \(V[a, b]\) needs many fewer coefficients than \(V[i, j, a, b]\). For a 1 megapixel image it has at most 1 million coefficients. This is 1 million fewer parameters since it no longer depends on the location within the image. We have made significant progress!

Now let us invoke the second principle—*locality*. As motivated above,
we believe that we should not have to look very far away from
\((i, j)\) in order to glean relevant information to assess what is
going on at \(h[i, j]\). This means that outside some range
\(|a|, |b| > \Delta\), we should set \(V[a, b] = 0\).
Equivalently, we can rewrite \(h[i, j]\) as

This, in a nutshell is the convolutional layer. When the local region
(also called a *receptive field*) is small, the difference as compared
to a fully-connected network can be dramatic. While previously, we might
have required billions of parameters to represent just a single layer in
an image-processing network, we now typically need just a few hundred.
The price that we pay for this drastic modification is that our features
will be translation invariant and that our layer can only take local
information into account. All learning depends on imposing inductive
bias. When that bias agrees with reality, we get sample-efficient models
that generalize well to unseen data. But of course, if those biases do
not agree with reality, e.g., if images turned out not to be translation
invariant, our models may not generalize well.

## 6.1.3. Convolutions¶

Let us briefly review why the above operation is called a *convolution*.
In mathematics, the convolution between two functions, say
\(f, g: \mathbb{R}^d \to R\) is defined as

That is, we measure the overlap between \(f\) and \(g\) when both functions are shifted by \(x\) and “flipped”. Whenever we have discrete objects, the integral turns into a sum. For instance, for vectors defined on \(\ell_2\), i.e., the set of square summable infinite dimensional vectors with index running over \(\mathbb{Z}\) we obtain the following definition.

For two-dimensional arrays, we have a corresponding sum with indices
\((i, j)\) for \(f\) and \((i-a, j-b)\) for \(g\)
respectively. This looks similar to definition above, with one major
difference. Rather than using \((i+a, j+b)\), we are using the
difference instead. Note, though, that this distinction is mostly
cosmetic since we can always match the notation by using
\(\tilde{V}[a, b] = V[-a, -b]\) to obtain
\(h = x \circledast \tilde{V}\). Also note that the original
definition is actually a *cross correlation*. We will come back to this
in the following section.

## 6.1.4. Waldo Revisited¶

Let us see what this looks like if we want to build an improved Waldo detector. The convolutional layer picks windows of a given size and weighs intensities according to the mask \(V\), as demonstrated in Fig. 6.1.2. We expect that wherever the “waldoness” is highest, we will also find a peak in the hidden layer activations.

There is just a problem with this approach: so far we blissfully ignored
that images consist of 3 channels: red, green and blue. In reality,
images are quite two-dimensional objects but rather as a
\(3^{\mathrm{rd}}\) order tensor, e.g., with shape
\(1024 \times 1024 \times 3\) pixels. Only two of these axes concern
spatial relationships, while the \(3^{\mathrm{rd}}\) can be regarded
as assigning a multidimensional representation *to each pixel location*.

We thus index \(\mathbf{x}\) as \(x[i, j, k]\). The convolutional mask has to adapt accordingly. Instead of \(V[a, b]\) we now have \(V[a, b, c]\).

Moreover, just as our input consists of a \(3^{\mathrm{rd}}\) order
tensor it turns out to be a good idea to similarly formulate our hidden
representations as \(3^{\mathrm{rd}}\) order tensors. In other
words, rather than just having a 1D representation corresponding to each
spatial location, we want to have a multidimensional hidden
representations corresponding to each spatial location. We could think
of the hidden representation as comprising a number of 2D grids stacked
on top of each other. These are sometimes called *channels* or *feature
maps*. Intuitively you might imagine that at lower layers, some channels
specialize to recognizing edges, We can take care of this by adding a
fourth coordinate to \(V\) via \(V[a, b, c, d]\). Putting all
together we have:

This is the definition of a convolutional neural network layer. There are still many operations that we need to address. For instance, we need to figure out how to combine all the activations to a single output (e.g., whether there is a Waldo in the image). We also need to decide how to compute things efficiently, how to combine multiple layers, and whether it is a good idea to have many narrow or a few wide layers. All of this will be addressed in the remainder of the chapter.

## 6.1.5. Summary¶

Translation invariance in images implies that all patches of an image will be treated in the same manner.

Locality means that only a small neighborhood of pixels will be used for computation.

Channels on input and output allows for meaningful feature analysis.

## 6.1.6. Exercises¶

Assume that the size of the convolution mask is \(\Delta = 0\). Show that in this case the convolutional mask implements an MLP independently for each set of channels.

Why might translation invariance not be a good idea after all? Does it make sense for pigs to fly?

What happens at the boundary of an image?

Derive an analogous convolutional layer for audio.

What goes wrong when you apply the above reasoning to text? Hint: what is the structure of language?

Prove that \(f \circledast g = g \circledast f\).