# 8.3. Recurrent Neural Networks¶

In the previous section we introduced \(n\)-gram models, where the conditional probability of word \(w_t\) at position \(t\) only depends on the \(n-1\) previous words. If we want to check the possible effect of words earlier than \(t-(n-1)\) on \(w_t\), we need to increase \(n\). However, the number of model parameters would also increase exponentially with it, as we need to store \(|V|^n\) numbers for a vocabulary \(V\). Hence, rather than modeling \(p(w_t|w_{t-1}, \ldots w_{t-n+1})\) it is preferable to use a latent variable model in which we have

For a sufficiently powerful function \(h_t\) this is not an approximation. After all, \(h_t\) could simply store all the data it observed so far. We discussed this in Section 8.1. Let’s see why building such models is a bit more tricky than simple autoregressive models where

As a warmup we will review the latter for discrete outputs and \(n=2\), i.e. for Markov model of first order. To simplify things further we use a single layer in the design of the RNN. Later on we will see how to add more expressivity efficiently across items.

## 8.3.3. Steps in a Language Model¶

We conclude this section by illustrating how RNNs can be used to build a language model. For simplicity of illustration we use words rather than characters, since the former are easier to comprehend. Let the number of mini-batch examples be 1, and the sequence of the text be the beginning of our dataset, i.e. “the time machine by h. g. wells”. The figure below illustrates how to estimate the next character based on the present and previous characters. During the training process, we run a softmax operation on the output from the output layer for each time step, and then use the cross-entropy loss function to compute the error between the result and the label. Due to the recurrent computation of the hidden state in the hidden layer, the output of time step 3 \(\mathbf{O}_3\) is determined by the text sequence “the”, “time”, “machine”. Since the next word of the sequence in the training data is “by”, the loss of time step 3 will depend on the probability distribution of the next word generated based on the sequence “the”, “time”, “machine” and the label “by” of this time step.

The number of words is huge compared to the number of characters. This is why quite often (such as in the subsequent sections) we will use a character-level RNN instead. In the next few sections, we will introduce its implementation.

## 8.3.4. Summary¶

A network that uses recurrent computation is called a recurrent neural network (RNN).

The hidden state of the RNN can capture historical information of the sequence up to the current time step.

The number of RNN model parameters does not grow as the number of time steps increases.

We can create language models using a character-level RNN.

## 8.3.5. Exercises¶

If we use an RNN to predict the next character in a text sequence, how many output dimensions do we need?

Can you design a mapping for which an RNN with hidden states is exact? Hint - what about a finite number of words?

What happens to the gradient if you backpropagate through a long sequence?

What are some of the problems associated with the simple sequence model described above?