{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mini-batch Stochastic Gradient Descent\n",
"\n",
"In each iteration, the gradient descent uses the entire training data set to compute the gradient, so it is sometimes referred to as batch gradient descent. Stochastic gradient descent (SGD) only randomly select one example in each iteration to compute the gradient. Just like in the previous chapters, we can perform random uniform sampling for each iteration to form a mini-batch and then use this mini-batch to compute the gradient. Now, we are going to discuss mini-batch stochastic gradient descent.\n",
"\n",
"Set objective function $f(\\boldsymbol{x}): \\mathbb{R}^d \\rightarrow \\mathbb{R}$. The time step before the start of iteration is set to 0. The independent variable of this time step is $\\boldsymbol{x}_0\\in \\mathbb{R}^d$ and is usually obtained by random initialization. In each subsequent time step $t>0$, mini-batch SGD uses random uniform sampling to get a mini-batch $\\mathcal{B}_t$ made of example indices from the training data set. We can use sampling with replacement or sampling without replacement to get a mini-batch example. The former method allows duplicate examples in the same mini-batch, the latter does not and is more commonly used. We can use either of the two methods\n",
"\n",
"$$\\boldsymbol{g}_t \\leftarrow \\nabla f_{\\mathcal{B}_t}(\\boldsymbol{x}_{t-1}) = \\frac{1}{|\\mathcal{B}|} \\sum_{i \\in \\mathcal{B}_t}\\nabla f_i(\\boldsymbol{x}_{t-1})$$\n",
"\n",
"to compute the gradient $\\boldsymbol{g}_t$ of the objective function at $\\boldsymbol{x}_{t-1}$ with mini-batch $\\mathcal{B}_t$ at time step $t$. Here, $|\\mathcal{B}|$ is the size of the batch, which is the number of examples in the mini-batch. This is a hyper-parameter. Just like the stochastic gradient, the mini-batch SGD $\\boldsymbol{g}_t$ obtained by sampling with replacement is also the unbiased estimate of the gradient $\\nabla f(\\boldsymbol{x}_{t-1})$. Given the learning rate $\\eta_t$ (positive), the iteration of the mini-batch SGD on the independent variable is as follows:\n",
"\n",
"$$\\boldsymbol{x}_t \\leftarrow \\boldsymbol{x}_{t-1} - \\eta_t \\boldsymbol{g}_t.$$\n",
"\n",
"The variance of the gradient based on random sampling cannot be reduced during the iterative process, so in practice, the learning rate of the (mini-batch) SGD can self-decay during the iteration, such as $\\eta_t=\\eta t^\\alpha$ (usually $\\alpha=-1$ or $-0.5$), $\\eta_t = \\eta \\alpha^t$ (e.g $\\alpha=0.95$), or learning rate decay once per iteration or after several iterations. As a result, the variance of the learning rate and the (mini-batch) SGD will decrease. Gradient descent always uses the true gradient of the objective function during the iteration, without the need to self-decay the learning rate.\n",
"\n",
"\n",
"The cost for computing each iteration is $\\mathcal{O}(|\\mathcal{B}|)$. When the batch size is 1, the algorithm is an SGD; when the batch size equals the example size of the training data, the algorithm is a gradient descent. When the batch size is small, fewer examples are used in each iteration, which will result in parallel processing and reduce the RAM usage efficiency. This makes it more time consuming to compute examples of the same size than using larger batches. When the batch size increases, each mini-batch gradient may contain more redundant information. To get a better solution, we need to compute more examples for a larger batch size, such as increasing the number of epochs.\n",
"\n",
"\n",
"## Reading Data\n",
"\n",
"In this chapter, we will use a data set developed by NASA to test the wing noise from different aircraft to compare these optimization algorithms[1]. We will use the first 1500 examples of the data set, 5 features, and a normalization method to preprocess the data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#!pip install matplotlib"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "1"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(1500, 5)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"%matplotlib inline\n",
"import d2l\n",
"from mxnet import autograd, gluon, init, nd\n",
"from mxnet.gluon import nn, data as gdata, loss as gloss\n",
"import numpy as np\n",
"import time\n",
"\n",
"# This function is saved in the d2l package for future use\n",
"def get_data_ch7():\n",
" data = np.genfromtxt('../data/airfoil_self_noise.dat', delimiter='\\t')\n",
" data = (data - data.mean(axis=0)) / data.std(axis=0)\n",
" return nd.array(data[:1500, :-1]), nd.array(data[:1500, -1])\n",
"\n",
"features, labels = get_data_ch7()\n",
"features.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementation from Scratch\n",
"\n",
"We have already implemented the mini-batch SGD algorithm in the [Linear Regression Implemented From Scratch](../chapter_deep-learning-basics/linear-regression-scratch.md) section. We have made its input parameters more generic here, so that we can conveniently use the same input for the other optimization algorithms introduced later in this chapter. Specifically, we add the status input `states` and place the hyper-parameter in dictionary `hyperparams`. In addition, we will average the loss of each mini-batch example in the training function, so the gradient in the optimization algorithm does not need to be divided by the batch size."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "2"
}
},
"outputs": [],
"source": [
"def sgd(params, states, hyperparams):\n",
" for p in params:\n",
" p[:] -= hyperparams['lr'] * p.grad"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we are going to implement a generic training function to facilitate the use of the other optimization algorithms introduced later in this chapter. It initializes a linear regression model and can then be used to train the model with the mini-batch SGD and other algorithms introduced in subsequent sections."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "29"
}
},
"outputs": [],
"source": [
"# This function is saved in the d2l package for future use\n",
"def train_ch7(trainer_fn, states, hyperparams, features, labels,\n",
" batch_size=10, num_epochs=2):\n",
" # Initialize model parameters\n",
" net, loss = d2l.linreg, d2l.squared_loss\n",
" w = nd.random.normal(scale=0.01, shape=(features.shape[1], 1))\n",
" b = nd.zeros(1)\n",
" w.attach_grad()\n",
" b.attach_grad()\n",
"\n",
" def eval_loss():\n",
" return loss(net(features, w, b), labels).mean().asscalar()\n",
"\n",
" ls, ts = [eval_loss()], [0,]\n",
" data_iter = gdata.DataLoader(\n",
" gdata.ArrayDataset(features, labels), batch_size, shuffle=True)\n",
" start = time.time()\n",
" for _ in range(num_epochs):\n",
" for batch_i, (X, y) in enumerate(data_iter):\n",
" with autograd.record():\n",
" l = loss(net(X, w, b), y).mean() # Average the loss\n",
" l.backward()\n",
" # Update model parameters\n",
" trainer_fn([w, b], states, hyperparams)\n",
" if (batch_i + 1) * batch_size % 10 == 0:\n",
" # Record the current training error for every 10 examples\n",
" ts.append(time.time() - start + ts[-1])\n",
" ls.append(eval_loss())\n",
" start = time.time()\n",
" \n",
" # Print and plot the results.\n",
" print('loss: %f, %f sec per epoch' % (ls[-1], ts[-1]/num_epochs))\n",
" d2l.set_figsize()\n",
" d2l.plt.plot(np.linspace(0, num_epochs, len(ls)), ls)\n",
" d2l.plt.xlabel('epoch')\n",
" d2l.plt.ylabel('loss')\n",
" return ts, ls"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When the batch size equals 1500 (the total number of examples), we use gradient descent for optimization. The model parameters will be iterated only once for each epoch of the gradient descent. As we can see, the downward trend of the value of the objective function (training loss) flattened out after 6 iterations."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "30"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.246095, 0.027745 sec per epoch\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"