{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Implementation of Recurrent Neural Networks from Scratch\n",
"\n",
"In this section we implement a language model from scratch. It is based on a character-level recurrent neural network trained on H. G. Wells' 'The Time Machine'. As before, we start by reading the dataset first."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "1"
}
},
"outputs": [],
"source": [
"import sys\n",
"sys.path.insert(0, '..')\n",
"\n",
"import d2l\n",
"import math\n",
"from mxnet import autograd, nd\n",
"from mxnet.gluon import loss as gloss\n",
"import time\n",
"\n",
"corpus_indices, vocab = d2l.load_data_time_machine()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## One-hot Encoding\n",
"\n",
"One-hot encoding vectors provide an easy way to express words as vectors in order to process them in a deep network. In a nutshell, we map each word to a different unit vector: assume that the number of different characters in the dictionary is $N$ (the `len(vocab)`) and each character has a one-to-one correspondence with a single value in the index of successive integers from 0 to $N-1$. If the index of a character is the integer $i$, then we create a vector $\\mathbf{e}_i$ of all 0s with a length of $N$ and set the element at position $i$ to 1. This vector is the one-hot vector of the original character. The one-hot vectors with indices 0 and 2 are shown below (the length of the vector is equal to the dictionary size)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "2"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
" [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]\n",
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nd.one_hot(nd.array([0, 2]), len(vocab))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that one-hot encodings are just a convenient way of separating the encoding (e.g. mapping the character `a` to $(1,0,0, \\ldots) vector)$ from the embedding (i.e. multiplying the encoded vectors by some weight matrix $\\mathbf{W}$). This simplifies the code greatly relative to storing an embedding matrix that the user needs to maintain. \n",
"\n",
"The shape of the mini-batch we sample each time is (batch size, time step). The following function transforms such mini-batches into a number of matrices with the shape of (batch size, dictionary size) that can be entered into the network. The total number of vectors is equal to the number of time steps. That is, the input of time step $t$ is $\\boldsymbol{X}_t \\in \\mathbb{R}^{n \\times d}$, where $n$ is the batch size and $d$ is the number of inputs. That is the one-hot vector length (the dictionary size)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "3"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(5, (2, 44))"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This function is saved in the d2l package for future use\n",
"def to_onehot(X, size): \n",
" return [nd.one_hot(x, size) for x in X.T]\n",
"\n",
"X = nd.arange(10).reshape((2, 5))\n",
"inputs = to_onehot(X, len(vocab))\n",
"len(inputs), inputs[0].shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code above generates 5 minibatches containing 2 vectors each. Since we have a total of 43 distinct symbols in \"The Time Machine\" we get 43-dimensional vectors.\n",
"\n",
"## Initializing the Model Parameters\n",
"\n",
"Next, we initialize the model parameters. The number of hidden units `num_hiddens` is a tunable parameter."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "4"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using gpu(0)\n"
]
}
],
"source": [
"num_inputs, num_hiddens, num_outputs = len(vocab), 512, len(vocab)\n",
"ctx = d2l.try_gpu()\n",
"print('Using', ctx)\n",
"\n",
"# Create the parameters of the model, initialize them and attach gradients\n",
"def get_params():\n",
" def _one(shape):\n",
" return nd.random.normal(scale=0.01, shape=shape, ctx=ctx)\n",
"\n",
" # Hidden layer parameters\n",
" W_xh = _one((num_inputs, num_hiddens))\n",
" W_hh = _one((num_hiddens, num_hiddens))\n",
" b_h = nd.zeros(num_hiddens, ctx=ctx)\n",
" # Output layer parameters\n",
" W_hq = _one((num_hiddens, num_outputs))\n",
" b_q = nd.zeros(num_outputs, ctx=ctx)\n",
" # Attach a gradient\n",
" params = [W_xh, W_hh, b_h, W_hq, b_q]\n",
" for param in params:\n",
" param.attach_grad()\n",
" return params"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sequence Modeling\n",
"\n",
"### RNN Model\n",
"\n",
"We implement this model based on the definition of an RNN. First, we need an `init_rnn_state` function to return the hidden state at initialization. It returns a tuple consisting of an NDArray with a value of 0 and a shape of (batch size, number of hidden units). Using tuples makes it easier to handle situations where the hidden state contains multiple NDArrays (e.g. when combining multiple layers in an RNN where each layers requires initializing)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "5"
}
},
"outputs": [],
"source": [
"def init_rnn_state(batch_size, num_hiddens, ctx):\n",
" return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following `rnn` function defines how to compute the hidden state and output in a time step. The activation function here uses the tanh function. As described in the [\"Multilayer Perceptron\"](../chapter_deep-learning-basics/mlp.md) section, the mean value of the $\\tanh$ function values is 0 when the elements are evenly distributed over the real numbers."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "6"
}
},
"outputs": [],
"source": [
"def rnn(inputs, state, params):\n",
" # Both inputs and outputs are composed of num_steps matrices of the shape\n",
" # (batch_size, len(vocab))\n",
" W_xh, W_hh, b_h, W_hq, b_q = params\n",
" H, = state\n",
" outputs = []\n",
" for X in inputs:\n",
" H = nd.tanh(nd.dot(X, W_xh) + nd.dot(H, W_hh) + b_h)\n",
" Y = nd.dot(H, W_hq) + b_q\n",
" outputs.append(Y)\n",
" return outputs, (H,)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's run a simple test to check whether the model makes any sense at all. In particular, let's check whether inputs and outputs have the correct dimensions, e.g. to ensure that the dimensionality of the hidden state hasn't changed."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "7"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(5, (2, 44), (2, 512))"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"state = init_rnn_state(X.shape[0], num_hiddens, ctx)\n",
"inputs = to_onehot(X.as_in_context(ctx), len(vocab))\n",
"params = get_params()\n",
"outputs, state_new = rnn(inputs, state, params)\n",
"len(outputs), outputs[0].shape, state_new[0].shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prediction Function\n",
"\n",
"The following function predicts the next `num_chars` characters based on the `prefix` (a string containing several characters). This function is a bit more complicated. Whenever the actual sequence is known, i.e. for the beginning of the sequence, we only update the hidden state. After that we begin generating new characters and emitting them. For convenience we use the recurrent neural unit `rnn` as a function parameter, so that this function can be reused in the other recurrent neural networks described in following sections."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "8"
}
},
"outputs": [],
"source": [
"# This function is saved in the d2l package for future use\n",
"def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,\n",
" num_hiddens, vocab, ctx):\n",
" state = init_rnn_state(1, num_hiddens, ctx)\n",
" output = [vocab[prefix[0]]]\n",
" for t in range(num_chars + len(prefix) - 1):\n",
" # The output of the previous time step is taken as the input of the\n",
" # current time step.\n",
" X = to_onehot(nd.array([output[-1]], ctx=ctx), len(vocab))\n",
" # Calculate the output and update the hidden state\n",
" (Y, state) = rnn(X, state, params)\n",
" # The input to the next time step is the character in the prefix or\n",
" # the current best predicted character\n",
" if t < len(prefix) - 1:\n",
" # Read off from the given sequence of characters\n",
" output.append(vocab[prefix[t + 1]])\n",
" else:\n",
" # This is maximum likelihood decoding. Modify this if you want\n",
" # use sampling, beam search or beam sampling for better sequences.\n",
" output.append(int(Y[0].argmax(axis=1).asscalar()))\n",
" return ''.join([vocab.idx_to_token[i] for i in output])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We test the `predict_rnn` function first. Given that we didn't train the network it will generate nonsensical predictions. We initialize it with the sequence `traveller ` and have it generate 10 additional characters."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "9"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'traveller bexhxhxhxh'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict_rnn('traveller ', 10, rnn, params, init_rnn_state, num_hiddens, \n",
" vocab, ctx)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Gradient Clipping\n",
"\n",
"When solving an optimization problem we take update steps for the\n",
"weights $\\mathbf{w}$ in the general direction of the negative gradient\n",
"$\\mathbf{g}_t$ on a minibatch, say $\\mathbf{w} - \\eta \\cdot \\mathbf{g}_t$. Let's further assume that the objective is well behaved, i.e. it is Lipschitz continuous with constant $L$, i.e. \n",
"\n",
"$$|l(\\mathbf{w}) - l(\\mathbf{w}')| \\leq L \\|\\mathbf{w} - \\mathbf{w}'\\|.$$\n",
"\n",
"In this case we can safely assume that if we update the weight vector by $\\eta \\cdot \\mathbf{g}_t$ we will not observe a change by more than $L \\eta \\|\\mathbf{g}_t\\|$. This is both a curse and a blessing. A curse since it limits the speed with which we can make progress, a blessing since it limits the extent to which things can go wrong if we move in the wrong direction. \n",
"\n",
"Sometimes the gradients can be quite large and the optimization algorithm may fail to converge. We could address this by reducing the learning rate $\\eta$ or by some other higher order trick. But what if we only rarely get large gradients? In this case such an approach may appear entirely unwarranted. One alternative is to clip the gradients by projecting them back to a ball of a given radius, say $\\theta$ via \n",
"\n",
"$$\\mathbf{g} \\leftarrow \\min\\left(1, \\frac{\\theta}{\\|\\mathbf{g}\\|}\\right) \\mathbf{g}.$$\n",
"\n",
"By doing so we know that the gradient norm never exceeds $\\theta$ and that the updated gradient is entirely aligned with the original direction $\\mathbf{g}$. It also has the desirable side-effect of limiting the influence any given minibatch (and within it any given sample) can exert on the weight vectors. This bestows a certain degree of robustness to the model. Back to the case at hand - optimization in RNNs. One of the issues is that the gradients in an RNN may either explode or vanish. Consider the chain of matrix-products involved in backpropagation. If the largest eigenvalue of the matrices is typically larger than $1$, then the product of many such matrices can be much larger than $1$. As a result, the aggregate gradient might explode. Gradient clipping provides a quick fix. While it doesn't entire solve the problem, it is one of the many techniques to alleviate it."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "10"
}
},
"outputs": [],
"source": [
"# This function is saved in the d2l package for future use\n",
"def grad_clipping(params, theta, ctx):\n",
" norm = nd.array([0], ctx)\n",
" for param in params:\n",
" norm += (param.grad ** 2).sum()\n",
" norm = norm.sqrt().asscalar()\n",
" if norm > theta:\n",
" for param in params:\n",
" param.grad[:] *= theta / norm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Perplexity\n",
"\n",
"One way of measuring how well a sequence model works is to check how surprising the text is. A good language model is able to predict with high accuracy what we will see next. Consider the following continuations of the phrase `It is raining`, as proposed by different language models:\n",
"\n",
"1. `It is raining outside`\n",
"1. `It is raining banana tree`\n",
"1. `It is raining piouw;kcj pwepoiut`\n",
"\n",
"In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent. While it might not quite so accurately reflect which word follows (`in San Francisco` and `in winter` would have been perfectly reasonable extensions), the model is able to capture which kind of word follows. Example 2 is considerably worse by producing a nonsensical and borderline dysgrammatical extension. Nonetheless, at least the model has learned how to spell words and some degree of correlation between words. Lastly, example 3 indicates a poorly trained model that doesn't fit data. \n",
"\n",
"One way of measuring the quality of the model is to compute $p(w)$, i.e. the likelihood of the sequence. Unfortunately this is a number that is hard to understand and difficult to compare. After all, shorter sequences are *much* more likely than long ones, hence evaluating the model on Tolstoy's magnum opus ['War and Peace'](https://www.gutenberg.org/files/2600/2600-h/2600-h.htm) will inevitably produce a much smaller likelihood than, say, on Saint-Exupery's novella ['The Little Prince'](https://en.wikipedia.org/wiki/The_Little_Prince). What is missing is the equivalent of an average. \n",
"\n",
"Information Theory comes handy here. If we want to compress text we can ask about estimating the next symbol given the current set of symbols. A lower bound on the number of bits is given by $-\\log_2 p(w_t|w_{t-1}, \\ldots w_1)$. A good language model should allow us to predict the next word quite accurately and thus it should allow us to spend very few bits on compressing the sequence. One way of measuring it is by the average number of bits that we need to spend.\n",
"\n",
"$$\\frac{1}{n} \\sum_{t=1}^n -\\log p(w_t|w_{t-1}, \\ldots w_1) = \\frac{1}{|w|} -\\log p(w)$$\n",
"\n",
"This makes the performance on documents of different lengths comparable. For historical reasons scientists in natural language processing prefer to use a quantity called perplexity rather than bitrate. In a nutshell it is the exponential of the above:\n",
"\n",
"$$\\mathrm{PPL} := \\exp\\left(-\\frac{1}{n} \\sum_{t=1}^n \\log p(w_t|w_{t-1}, \\ldots w_1)\\right)$$\n",
"\n",
"It can be best understood as the harmonic mean of the number of real choices that we have when deciding which word to pick next. Note that Perplexity naturally generalizes the notion of the cross entropy loss defined when we introduced [Softmax Regression](../chapter_deep-learning-basics/softmax-regression.md). That is, for a single symbol both definitions are identical bar the fact that one is the exponential of the other. Let's look at a number of cases:\n",
"\n",
"* In the best case scenario, the model always estimates the probability of the next symbol as $1$. In this case the perplexity of the model is $1$.\n",
"* In the worst case scenario, the model always predicts the probability of the label category as 0. In this situation, the perplexity is infinite.\n",
"* At the baseline, the model predicts a uniform distribution over all tokens. In this case the perplexity equals the size of the dictionary `len(vocab)`. In fact, if we were to store the sequence without any compression this would be the best we could do to encode it. Hence this provides a nontrivial upper bound that any model must satisfy. \n",
"\n",
"## Training the Model\n",
"\n",
"Training a sequence model proceeds quite different from previous codes. In particular we need to take care of the following changes due to the fact that the tokens appear in order:\n",
"\n",
"1. We use perplexity to evaluate the model. This ensures that different tests are comparable. \n",
"1. We clip the gradient before updating the model parameters. This ensures that the model doesn't diverge even when gradients blow up at some point during the training process (effectively it reduces the stepsize automatically).\n",
"3. Different sampling methods for sequential data (independent sampling and sequential partitioning) will result in differences in the initialization of hidden states. We discussed these issues in detail when we covered [data processing](lang-model-dataset.md).\n",
"\n",
"### Optimization Loop"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "11"
}
},
"outputs": [],
"source": [
"# This function is saved in the d2l package for future use\n",
"def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,\n",
" corpus_indices, vocab, ctx, is_random_iter, \n",
" num_epochs, num_steps, lr, clipping_theta, \n",
" batch_size, prefixes):\n",
" if is_random_iter:\n",
" data_iter_fn = d2l.data_iter_random\n",
" else:\n",
" data_iter_fn = d2l.data_iter_consecutive\n",
" params = get_params()\n",
" loss = gloss.SoftmaxCrossEntropyLoss()\n",
" start = time.time()\n",
" for epoch in range(num_epochs):\n",
" if not is_random_iter: \n",
" # If adjacent sampling is used, the hidden state is initialized \n",
" # at the beginning of the epoch\n",
" state = init_rnn_state(batch_size, num_hiddens, ctx)\n",
" l_sum, n = 0.0, 0\n",
" data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, ctx)\n",
" for X, Y in data_iter:\n",
" if is_random_iter: \n",
" # If random sampling is used, the hidden state is initialized \n",
" # before each mini-batch update\n",
" state = init_rnn_state(batch_size, num_hiddens, ctx)\n",
" else: \n",
" # Otherwise, the detach function needs to be used to separate \n",
" # the hidden state from the computational graph to avoid \n",
" # backpropagation beyond the current sample\n",
" for s in state:\n",
" s.detach()\n",
" with autograd.record():\n",
" inputs = to_onehot(X, len(vocab))\n",
" # outputs is num_steps terms of shape (batch_size, len(vocab))\n",
" (outputs, state) = rnn(inputs, state, params)\n",
" # After stitching it is (num_steps * batch_size, len(vocab))\n",
" outputs = nd.concat(*outputs, dim=0)\n",
" # The shape of Y is (batch_size, num_steps), and then becomes \n",
" # a vector with a length of batch * num_steps after \n",
" # transposition. This gives it a one-to-one correspondence \n",
" # with output rows\n",
" y = Y.T.reshape((-1,))\n",
" # Average classification error via cross entropy loss\n",
" l = loss(outputs, y).mean()\n",
" l.backward()\n",
" grad_clipping(params, clipping_theta, ctx) # Clip the gradient\n",
" d2l.sgd(params, lr, 1) \n",
" # Since the error is the mean, no need to average gradients here\n",
" l_sum += l.asscalar() * y.size\n",
" n += y.size\n",
" if (epoch + 1) % 50 == 0:\n",
" print('epoch %d, perplexity %f, time %.2f sec' % (\n",
" epoch + 1, math.exp(l_sum / n), time.time() - start))\n",
" start = time.time()\n",
" if (epoch + 1) % 100 == 0:\n",
" for prefix in prefixes:\n",
" print(' -', predict_rnn(prefix, 50, rnn, params, \n",
" init_rnn_state, num_hiddens,\n",
" vocab, ctx))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Experiments with a Sequence Model\n",
"\n",
"Now we can train the model. First, we need to set the model hyper-parameters. To allow for some meaningful amount of context we set the sequence length to 64. In particular, we will see how training using the 'separate' and 'sequential' term generation will affect the performance of the model."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "12"
}
},
"outputs": [],
"source": [
"num_epochs, num_steps, batch_size, lr, clipping_theta = 500, 64, 32, 1, 1\n",
"prefixes = ['traveller', 'time traveller']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use random sampling to train the model and produce some text."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "13"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 50, perplexity 10.978435, time 9.63 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 100, perplexity 9.014272, time 9.56 sec\n",
" - traveller and the the the the the the the the the the the t\n",
" - time traveller and the the the the the the the the the the the t\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 150, perplexity 7.819009, time 9.58 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 200, perplexity 6.815296, time 9.52 sec\n",
" - traveller so the g the t me the the the ghis the timention \n",
" - time traveller so he the the ghis the timention dime tion the th\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 250, perplexity 5.853415, time 9.58 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 300, perplexity 4.593592, time 9.53 sec\n",
" - traveller. 'it lly thit s abe as ans and to at fict. 'ic to\n",
" - time traveller the time traveller the time traveller the time tr\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 350, perplexity 2.979017, time 9.66 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 400, perplexity 2.069119, time 9.62 sec\n",
" - traveller smiled lound mave aterlall the ertime sac mather \n",
" - time traveller smiled lound mave aterlall the ertime sac mather \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 450, perplexity 1.625378, time 9.68 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 500, perplexity 1.398721, time 9.55 sec\n",
" - traveller. 'but now you begin to see the object of my inves\n",
" - time traveller held in his hand was a glittering metallic framew\n"
]
}
],
"source": [
"train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, \n",
" corpus_indices, vocab, ctx, True, num_epochs, \n",
" num_steps, lr, clipping_theta, batch_size, prefixes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even though our model was rather primitive, it is nonetheless able to produce text that resembles language. Now let's compare this with sequential partitioning."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "19"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 50, perplexity 11.445038, time 9.50 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 100, perplexity 8.935954, time 9.51 sec\n",
" - travellere the the the the the the the the the the the the \n",
" - time travellere the the the the the the the the the the the the \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 150, perplexity 7.551598, time 9.54 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 200, perplexity 6.506595, time 9.47 sec\n",
" - traveller, and the time time sime sime sime sime sime sime \n",
" - time traveller. 'the time trace lar all in thi g an thing the ti\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 250, perplexity 5.222512, time 9.54 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 300, perplexity 3.131410, time 9.60 sec\n",
" - traveller aflere the that us sone in time. 'y all have are \n",
" - time traveller cald in the exment anglow, the time traveller cal\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 350, perplexity 1.915635, time 9.66 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 400, perplexity 1.456524, time 9.61 sec\n",
" - traveller, wathed thon s are mere arof hithrace fas of stew\n",
" - time traveller held in his hand was a glitter ur mevest anoll on\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 450, perplexity 1.092744, time 9.59 sec\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"epoch 500, perplexity 1.088562, time 9.52 sec\n",
" - traveller smiled. 'are you sure we can move freely in space\n",
" - time traveller smiled round at us. then, you would a tometha is \n"
]
}
],
"source": [
"train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, \n",
" corpus_indices, vocab, ctx, False, num_epochs, \n",
" num_steps, lr, clipping_theta, batch_size, prefixes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following we will see how to improve significantly on the current model and how to make it faster and easier to implement. \n",
"\n",
"## Summary\n",
"\n",
"* Sequence models need state initialization for training.\n",
"* Between sequential models you need to ensure to detach the gradient, to ensure that the automatic differentiation does not propagate effects beyond the current sample.\n",
"* A simple RNN language model consists of an encoder, an RNN model and a decoder. \n",
"* Gradient clipping prevents gradient explosion (but it cannot fix vanishing gradients).\n",
"* Perplexity calibrates model performance across variable sequence length. It is the exponentiated average of the cross-entropy loss.\n",
"* Sequential partitioning typically leads to better models. \n",
"\n",
"## Exercises\n",
"\n",
"1. Show that one-hot encoding is equivalent to picking a different embedding for each object.\n",
"1. Adjust the hyperparameters to improve the perplexity. \n",
" * How low can you go? Adjust embeddings, hidden units, learning rate, etc.\n",
" * How well will it work on other books by H. G. Wells, e.g. [The War of the Worlds](http://www.gutenberg.org/ebooks/36).\n",
"1. Run the code in this section without clipping the gradient. What happens?\n",
"1. Set the `pred_period` variable to 1 to observe how the under-trained model (high perplexity) writes lyrics. What can you learn from this?\n",
"1. Change adjacent sampling so that it does not separate hidden states from the computational graph. Does the running time change? How about the accuracy?\n",
"1. Replace the activation function used in this section with ReLU and repeat the experiments in this section.\n",
"1. Prove that the perplexity is the inverse of the harmonic mean of the conditional word probabilities. \n",
"\n",
"## Scan the QR Code to [Discuss](https://discuss.mxnet.io/t/2364)\n",
"\n",
"![](../img/qr_rnn-scratch.svg)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}