# 3.2. Linear Regression Implementation from Scratch¶

After getting some background on linear regression, we are now ready for
a hands-on implementation. While a powerful deep learning framework
minimizes repetitive work, relying on it too much to make things easy
can make it hard to properly understand how deep learning works. This
matters in particular if we want to change things later, e.g. define our
own layers, loss functions, etc. Because of this, we start by describing
how to implement linear regression training using only NDArray and
`autograd`

.

Before we begin, let’s import the package or module required for this
section’s experiment; `matplotlib`

will be used for plotting and will
be set to embed in the GUI.

```
In [1]:
```

```
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random
```

## 3.2.1. Generating Data Sets¶

By constructing a simple artificial training data set, we can visually compare the differences between the parameters we have learned and the actual model parameters. Set the number of examples in the training data set as 1000 and the number of inputs (feature number) as 2. Using the randomly generated batch example feature \(\mathbf{X}\in \mathbb{R}^{1000 \times 2}\), we use the actual weight \(\mathbf{w} = [2, -3.4]^\top\) and bias \(b = 4.2\) of the linear regression model, as well as a random noise item \(\epsilon\) to generate the tag

The noise term \(\epsilon\) (or rather each coordinate of it) obeys a normal distribution with a mean of 0 and a standard deviation of 0.01. To get a better idea, let us generate the dataset.

```
In [2]:
```

```
num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
```

Note that each row in `features`

consists of a 2-dimensional data
point and that each row in `labels`

consists of a 1-dimensional target
value (a scalar).

```
In [3]:
```

```
features[0], labels[0]
```

```
Out[3]:
```

```
(
[2.2122064 0.7740038]
<NDArray 2 @cpu(0)>,
[6.000587]
<NDArray 1 @cpu(0)>)
```

By generating a scatter plot using the second `features[:, 1]`

and
`labels`

, we can clearly observe the linear correlation between the
two.

```
In [4]:
```

```
def use_svg_display():
# Display in vector graphics
display.set_matplotlib_formats('svg')
def set_figsize(figsize=(3.5, 2.5)):
use_svg_display()
# Set the size of the graph to be plotted
plt.rcParams['figure.figsize'] = figsize
set_figsize()
plt.figure(figsize=(10, 6))
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);
```

The plotting function `plt`

as well as the `use_svg_display`

and
`set_figsize`

functions are defined in the `d2l`

package. We will
call `d2l.plt`

directly for future plotting. To print the vector
diagram and set its size, we only need to call `d2l.set_figsize()`

before plotting, because `plt`

is a global variable in the `d2l`

package.

## 3.2.2. Reading Data¶

We need to iterate over the entire data set and continuously examine
mini-batches of data examples when training the model. Here we define a
function. Its purpose is to return the features and tags of random
`batch_size`

(batch size) examples every time it’s called. One might
wonder why we are not reading one observation at a time but rather
construct an iterator which returns a few observations at a time. This
has mostly to do with efficiency when optimizing. Recall that when we
processed one dimension at a time the algorithm was quite slow. The same
thing happens when processing single observations vs. an entire ‘batch’
of them, which can be represented as a matrix rather than just a vector.
In particular, GPUs are much faster when it comes to dealing with
matrices, up to an order of magnitude. This is one of the reasons why
deep learning usually operates on mini-batches rather than singletons.

```
In [5]:
```

```
# This function has been saved in the d2l package for future use
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
# The examples are read at random, in no particular order
random.shuffle(indices)
for i in range(0, num_examples, batch_size):
j = nd.array(indices[i: min(i + batch_size, num_examples)])
yield features.take(j), labels.take(j)
# The “take” function will then return the corresponding element based
# on the indices
```

Let’s read and print the first small batch of data examples. The shape of the features in each batch corresponds to the batch size and the number of input dimensions. Likewise, we obtain as many labels as requested by the batch size.

```
In [6]:
```

```
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
print(X, y)
break
```

```
[[ 0.74844444 -2.7521858 ]
[-1.5366337 -1.015936 ]
[ 0.73648626 0.5670942 ]
[-0.5451464 1.0648695 ]
[-0.1461925 -0.3182019 ]
[-0.38307324 0.60577285]
[ 1.1499298 0.47734314]
[ 1.0132014 0.9695031 ]
[-0.4295941 -1.3485998 ]
[-0.13386123 0.3524519 ]]
<NDArray 10x2 @cpu(0)>
[15.052703 4.5778785 3.73271 -0.5462227 4.9912653 1.3707838
4.8553643 2.954539 7.912276 2.7298915]
<NDArray 10 @cpu(0)>
```

Clearly, if we run the iterator again, we obtain a different minibatch until all the data has been exhausted (try this). Note that the iterator described above is a bit inefficient (it requires that we load all data in memory and that we perform a lot of random memory access). The built-in iterators are more efficient and they can deal with data stored on file (or being fed via a data stream).

## 3.2.3. Initialize Model Parameters¶

Weights are initialized to normal random numbers using a mean of 0 and a standard deviation of 0.01, with the bias \(b\) set to zero.

```
In [7]:
```

```
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
```

In the succeeding cells, we’re going to update these parameters to
better fit our data. This will involve taking the gradient (a
multi-dimensional derivative) of some loss function with respect to the
parameters. We’ll update each parameter in the direction that reduces
the loss. In order for `autograd`

to know that it needs to set up the
appropriate data structures, track changes, etc., we need to attach
gradients explicitly.

```
In [8]:
```

```
w.attach_grad()
b.attach_grad()
```

## 3.2.4. Define the Model¶

Next we’ll want to define our model. In this case, we’ll be working with linear models, the simplest possible useful neural network. To calculate the output of the linear model, we simply multiply a given input with the model’s weights \(w\), and add the offset \(b\).

```
In [9]:
```

```
# This function has been saved in the d2l package for future use
def linreg(X, w, b):
return nd.dot(X, w) + b
```

## 3.2.5. Define the Loss Function¶

We will use the squared loss function described in the previous section
to define the linear regression loss. In the implementation, we need to
transform the true value `y`

into the predicted value’s shape
`y_hat`

. The result returned by the following function will also be
the same as the `y_hat`

shape.

```
In [10]:
```

```
# This function has been saved in the d2l package for future use
def squared_loss(y_hat, y):
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
```

## 3.2.6. Define the Optimization Algorithm¶

Linear regression actually has a closed-form solution. However, most
interesting models that we’ll care about cannot be solved analytically.
So we’ll solve this problem by stochastic gradient descent `sgd`

. At
each step, we’ll estimate the gradient of the loss with respect to our
weights, using one batch randomly drawn from our dataset. Then, we’ll
update our parameters a small amount in the direction that reduces the
loss. Here, the gradient calculated by the automatic differentiation
module is the gradient sum of a batch of examples. We divide it by the
batch size to obtain the average. The size of the step is determined by
the learning rate `lr`

.

```
In [11]:
```

```
# This function has been saved in the d2l package for future use
def sgd(params, lr, batch_size):
for param in params:
param[:] = param - lr * param.grad / batch_size
```

## 3.2.7. Training¶

In training, we will iterate over the data to improve the model
parameters. In each iteration, the mini-batch stochastic gradient is
calculated by first calling the inverse function `backward`

depending
on the currently read mini-batch data examples (feature `X`

and label
`y`

), and then calling the optimization algorithm `sgd`

to iterate
the model parameters. Since we previously set the batch size
`batch_size`

to 10, the loss shape `l`

for each small batch is (10,
1).

- Initialize parameters \((\mathbf{w}, b)\)
- Repeat until done
- Compute gradient \(\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{\mathcal{B}} \sum_{i \in \mathcal{B}} l(\mathbf{x}^i, y^i, \mathbf{w}, b)\)
- Update parameters \((\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}\)

Since nobody wants to compute gradients explicitly (this is tedious and
error prone) we use automatic differentiation to compute \(g\). See
section “Automatic Gradient”
for more details. Since the loss `l`

is not a scalar variable, running
`l.backward()`

will add together the elements in `l`

to obtain the
new variable, and then calculate the variable model parameters’
gradient.

In an epoch (a pass through the data), we will iterate through the
`data_iter`

function once and use it for all the examples in the
training data set (assuming the number of examples is divisible by the
batch size). The number of epochs `num_epochs`

and the learning rate
`lr`

are both hyper-parameters and are set to 3 and 0.03,
respectively. Unfortunately in practice, the majority of the
hyper-parameters will require some adjustment by trial and error. For
instance, the model might actually become more accurate by training
longer (but this increases computational cost). Likewise, we might want
to change the learning rate on the fly. We will discuss this later in
the chapter on “Optimization
Algorithms”.

```
In [12]:
```

```
lr = 0.03 # Learning rate
num_epochs = 3 # Number of iterations
net = linreg # Our fancy linear model
loss = squared_loss # 0.5 (y-y')^2
for epoch in range(num_epochs):
# Assuming the number of examples can be divided by the batch size, all
# the examples in the training data set are used once in one epoch
# iteration. The features and tags of mini-batch examples are given by X
# and y respectively
for X, y in data_iter(batch_size, features, labels):
with autograd.record():
l = loss(net(X, w, b), y) # Minibatch loss in X and y
l.backward() # Compute gradient on l with respect to [w,b]
sgd([w, b], lr, batch_size) # Update parameters using their gradient
train_l = loss(net(features, w, b), labels)
print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))
```

```
epoch 1, loss 0.040925
epoch 2, loss 0.000158
epoch 3, loss 0.000050
```

To evaluate the trained model, we can compare the actual parameters used with the parameters we have learned after the training has been completed. They are very close to each other.

```
In [13]:
```

```
print('Error in estimating w', true_w - w.reshape(true_w.shape))
print('Error in estimating b', true_b - b)
```

```
Error in estimating w
[0.00045097 0.00021887]
<NDArray 2 @cpu(0)>
Error in estimating b
[0.00038624]
<NDArray 1 @cpu(0)>
```

Note that we should not take it for granted that we are able to reover
the parameters accurately. This only happens for a special category
problems: strongly convex optimization problems with ‘enough’ data to
ensure that the noisy samples allow us to recover the underlying
dependency correctly. In most cases this is *not* the case. In fact, the
parameters of a deep network are rarely the same (or even close) between
two different runs, unless everything is kept identically, including the
order in which the data is traversed. Nonetheless this can lead to very
good solutions, mostly due to the fact that quite often there are many
sets of parameters that work well.

## 3.2.8. Summary¶

We saw how a deep network can be implemented and optimized from scratch,
using just NDArray and `autograd`

without any need for defining
layers, fancy optimizers, etc. This only scratches the surface of what
is possible. In the following sections, we will describe additional deep
learning models based on what we have just learned and you will learn
how to implement them using more concisely.

## 3.2.9. Exercises¶

- What would happen if we were to initialize the weights \(\mathbf{w} = 0\). Would the algorithm still work?
- Assume that you’re Georg Simon
Ohm trying to come up
with a model between voltage and current. Can you use
`autograd`

to learn the parameters of your model. - Can you use Planck’s Law to determine the temperature of an object using spectral energy density.
- What are the problems you might encounter if you wanted to extend
`autograd`

to second derivatives? How would you fix them? - Why is the
`reshape`

function needed in the`squared_loss`

function? - Experiment using different learning rates to find out how fast the loss function value drops.
- If the number of examples cannot be divided by the batch size, what
happens to the
`data_iter`

function’s behavior?