{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Adam\n",
"\n",
"Created on the basis of RMSProp, Adam also uses EWMA on the mini-batch stochastic gradient[1]. Here, we are going to introduce this algorithm.\n",
"\n",
"## The Algorithm\n",
"\n",
"Adam uses the momentum variable $\\boldsymbol{v}_t$ and variable $\\boldsymbol{s}_t$, which is an EWMA on the squares of elements in the mini-batch stochastic gradient from RMSProp, and initializes each element of the variables to 0 at time step 0. Given the hyperparameter $0 \\leq \\beta_1 < 1$ (the author of the algorithm suggests a value of 0.9), the momentum variable $\\boldsymbol{v}_t$ at time step $t$ is the EWMA of the mini-batch stochastic gradient $\\boldsymbol{g}_t$:\n",
"\n",
"$$\\boldsymbol{v}_t \\leftarrow \\beta_1 \\boldsymbol{v}_{t-1} + (1 - \\beta_1) \\boldsymbol{g}_t. $$\n",
"\n",
"Just as in RMSProp, given the hyperparameter $0 \\leq \\beta_2 < 1$ (the author of the algorithm suggests a value of 0.999),\n",
"After taken the squares of elements in the mini-batch stochastic gradient, find $\\boldsymbol{g}_t \\odot \\boldsymbol{g}_t$ and perform EWMA on it to obtain $\\boldsymbol{s}_t$:\n",
"\n",
"$$\\boldsymbol{s}_t \\leftarrow \\beta_2 \\boldsymbol{s}_{t-1} + (1 - \\beta_2) \\boldsymbol{g}_t \\odot \\boldsymbol{g}_t. $$\n",
"\n",
"Since we initialized elements in $\\boldsymbol{v}_0$ and $\\boldsymbol{s}_0$ to 0,\n",
"we get $\\boldsymbol{v}_t = (1-\\beta_1) \\sum_{i=1}^t \\beta_1^{t-i} \\boldsymbol{g}_i$ at time step $t$. Sum the mini-batch stochastic gradient weights from each previous time step to get $(1-\\beta_1) \\sum_{i=1}^t \\beta_1^{t-i} = 1 - \\beta_1^t$. Notice that when $t$ is small, the sum of the mini-batch stochastic gradient weights from each previous time step will be small. For example, when $\\beta_1 = 0.9$, $\\boldsymbol{v}_1 = 0.1\\boldsymbol{g}_1$. To eliminate this effect, for any time step $t$, we can divide $\\boldsymbol{v}_t$ by $1 - \\beta_1^t$, so that the sum of the mini-batch stochastic gradient weights from each previous time step is 1. This is also called bias correction. In the Adam algorithm, we perform bias corrections for variables $\\boldsymbol{v}_t$ and $\\boldsymbol{s}_t$:\n",
"\n",
"$$\\hat{\\boldsymbol{v}}_t \\leftarrow \\frac{\\boldsymbol{v}_t}{1 - \\beta_1^t}, $$\n",
"\n",
"$$\\hat{\\boldsymbol{s}}_t \\leftarrow \\frac{\\boldsymbol{s}_t}{1 - \\beta_2^t}. $$\n",
"\n",
"\n",
"Next, the Adam algorithm will use the bias-corrected variables $\\hat{\\boldsymbol{v}}_t$ and $\\hat{\\boldsymbol{s}}_t$ from above to re-adjust the learning rate of each element in the model parameters using element operations.\n",
"\n",
"$$\\boldsymbol{g}_t' \\leftarrow \\frac{\\eta \\hat{\\boldsymbol{v}}_t}{\\sqrt{\\hat{\\boldsymbol{s}}_t} + \\epsilon},$$\n",
"\n",
"Here, $\\eta$ is the learning rate while $\\epsilon$ is a constant added to maintain numerical stability, such as $10^{-8}$. Just as for Adagrad, RMSProp, and Adadelta, each element in the independent variable of the objective function has its own learning rate. Finally, use $\\boldsymbol{g}_t'$ to iterate the independent variable:\n",
"\n",
"$$\\boldsymbol{x}_t \\leftarrow \\boldsymbol{x}_{t-1} - \\boldsymbol{g}_t'. $$\n",
"\n",
"## Implementation from Scratch\n",
"\n",
"We use the formula from the algorithm to implement Adam. Here, time step $t$ uses `hyperparams` to input parameters to the `adam` function."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "2"
}
},
"outputs": [],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"%matplotlib inline\n",
"import d2l\n",
"from mxnet import nd\n",
"\n",
"features, labels = d2l.get_data_ch7()\n",
"\n",
"def init_adam_states():\n",
" v_w, v_b = nd.zeros((features.shape[1], 1)), nd.zeros(1)\n",
" s_w, s_b = nd.zeros((features.shape[1], 1)), nd.zeros(1)\n",
" return ((v_w, s_w), (v_b, s_b))\n",
"\n",
"def adam(params, states, hyperparams):\n",
" beta1, beta2, eps = 0.9, 0.999, 1e-6\n",
" for p, (v, s) in zip(params, states):\n",
" v[:] = beta1 * v + (1 - beta1) * p.grad\n",
" s[:] = beta2 * s + (1 - beta2) * p.grad.square()\n",
" v_bias_corr = v / (1 - beta1 ** hyperparams['t'])\n",
" s_bias_corr = s / (1 - beta2 ** hyperparams['t'])\n",
" p[:] -= hyperparams['lr'] * v_bias_corr / (s_bias_corr.sqrt() + eps)\n",
" hyperparams['t'] += 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use Adam to train the model with a learning rate of $0.01$."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "5"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"loss: 0.246945, 0.369802 sec per epoch\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"