10.3. Deep Recurrent Neural Networks¶ Open the notebook in SageMaker Studio Lab
Up until now, we have focused on defining networks consisting of a
sequence input, a single hidden RNN layer, and an output layer. Despite
having just one hidden layer between the input at any time step and the
corresponding output, there is a sense in which these networks are deep.
Inputs from the first time step can influence the outputs at the final
time step
The standard method for building this sort of deep RNN is strikingly
simple: we stack the RNNs on top of each other. Given a sequence of
length
Fig. 10.3.1 Architecture of a deep RNN.¶
Formally, suppose that we have a minibatch input
where the weights
At the end, the calculation of the output layer is only based on the
hidden state of the final
where the weight
Just as with MLPs, the number of hidden layers
10.3.1. Implementation from Scratch¶
To implement a multilayer RNN from scratch, we can treat each layer as
an RNNScratch
instance with its own learnable parameters.
The multilayer forward computation simply performs forward computation layer by layer.
As an example, we train a deep GRU model on The Time Machine dataset (same as in Section 9.5). To keep things simple we set the number of layers to 2.
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
with d2l.try_gpu():
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1)
trainer.fit(model, data)
10.3.2. Concise Implementation¶
Fortunately many of the logistical details required to implement multiple layers of an RNN are readily available in high-level APIs. Our concise implementation will use such built-in functionalities. The code generalizes the one we used previously in Section 10.2, letting us specify the number of layers explicitly rather than picking the default of only one layer.
Fortunately many of the logistical details required to implement multiple layers of an RNN are readily available in high-level APIs. Our concise implementation will use such built-in functionalities. The code generalizes the one we used previously in Section 10.2, letting us specify the number of layers explicitly rather than picking the default of only one layer.
Flax takes a minimalistic approach while implementing RNNs. Defining the
number of layers in an RNN or combining it with dropout is not available
out of the box. Our concise implementation will use all built-in
functionalities and add num_layers
and dropout
features on top.
The code generalizes the one we used previously in Section 10.2,
allowing specification of the number of layers explicitly rather than
picking the default of a single layer.
class GRU(d2l.RNN): #@save
"""The multilayer GRU model."""
num_hiddens: int
num_layers: int
dropout: float = 0
@nn.compact
def __call__(self, X, state=None, training=False):
outputs = X
new_state = []
if state is None:
batch_size = X.shape[1]
state = [nn.GRUCell.initialize_carry(jax.random.PRNGKey(0),
(batch_size,), self.num_hiddens)] * self.num_layers
GRU = nn.scan(nn.GRUCell, variable_broadcast="params",
in_axes=0, out_axes=0, split_rngs={"params": False})
# Introduce a dropout layer after every GRU layer except last
for i in range(self.num_layers - 1):
layer_i_state, X = GRU()(state[i], outputs)
new_state.append(layer_i_state)
X = nn.Dropout(self.dropout, deterministic=not training)(X)
# Final GRU layer without dropout
out_state, X = GRU()(state[-1], X)
new_state.append(out_state)
return X, jnp.array(new_state)
Fortunately many of the logistical details required to implement multiple layers of an RNN are readily available in high-level APIs. Our concise implementation will use such built-in functionalities. The code generalizes the one we used previously in Section 10.2, letting us specify the number of layers explicitly rather than picking the default of only one layer.
class GRU(d2l.RNN): #@save
"""The multilayer GRU model."""
def __init__(self, num_hiddens, num_layers, dropout=0):
d2l.Module.__init__(self)
self.save_hyperparameters()
gru_cells = [tf.keras.layers.GRUCell(num_hiddens, dropout=dropout)
for _ in range(num_layers)]
self.rnn = tf.keras.layers.RNN(gru_cells, return_sequences=True,
return_state=True, time_major=True)
def forward(self, X, state=None):
outputs, *state = self.rnn(X, state)
return outputs, state
The architectural decisions such as choosing hyperparameters are very
similar to those of Section 10.2. We pick the same number of
inputs and outputs as we have distinct tokens, i.e., vocab_size
. The
number of hidden units is still 32. The only difference is that we now
select a nontrivial number of hidden layers by specifying the value of
num_layers
.
'it has for and the time th'
'it has is the prough said '
10.3.3. Summary¶
In deep RNNs, the hidden state information is passed to the next time step of the current layer and the current time step of the next layer. There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or vanilla RNNs. Conveniently, these models are all available as parts of the high-level APIs of deep learning frameworks. Initialization of models requires care. Overall, deep RNNs require considerable amount of work (such as learning rate and clipping) to ensure proper convergence.
10.3.4. Exercises¶
Replace the GRU by an LSTM and compare the accuracy and training speed.
Increase the training data to include multiple books. How low can you go on the perplexity scale?
Would you want to combine sources of different authors when modeling text? Why is this a good idea? What could go wrong?