Recurrent Neural Network Implementation from Scratch

https://d2l.ai/chapter_recurrent-neural-networks/rnn-scratch.html

It seems that sequential partitioning is used in this implementation for minibatch generation, so in exercise 2.2, maybe it should be ‘replace sequential partitioning with random sampling’?

Also, in the end when we test random sampling of hidden states, is it more reasonable to set d2l.load_data_time_machine(batch_size, num_steps, use_random_iter=True)?
Because in that way, each adjacent minibatch wouldn’t be adjacent in sequence, so the states for previous minibatch should be randomly initialized. If we use sequential partitioning, we should continue to use previous sates because of the continuity of data.

Hey @Songlin_Zheng, thanks for your proposal! Fixed in https://github.com/d2l-ai/d2l-en/pull/1514

In the last part, before using the random sampling method, the model should be created again. Otherwise it is a pretrained model.

@pchen train_ch8 forcefully reinitializes model’s parameters, so there is no need to instantiate a new model.
What should really be created anew here, prior to train_ch8 run with random sampling, is train_iter. Initially, it was created with the default use_random_iter=False, but the second time, we want it to generate random minibatches, so use_random_iter should be True. While results will likely remain the same, it is a little inconsistency to train the model with random sampling on the sequentially partitioned train_iter.

I think you are right, since the perplexity of the second graph is too small at the beginning.

train_ch8 doesn’t reinitialize the parameters of net(). It’s only initialized when it’s created net = RNNModelScratch(...)

I tried reinitializing net() before the training of random sampling. The training curve looked completely different.

1 Like

@pchen Oh, I see… Reinitialization of RNNModelScratch does not happen because it does not inherit mxnet.gluon.Block / torch.nn.Module.

I was talking about running train_ch8(..., use_random_iter=True) on sequentially partitioned train_iter (current implementation) vs. randomly sampled train_iter (expected). With the default hyperparameters, I got almost identical convergence rates and the final perplexity values. Although, some hyperparameter tuning might be required for more objectivity.

@pchen @sanjaradylov
You are correct, the training should start from new.
I have created a Github issue for it:
https://github.com/d2l-ai/d2l-en/issues/1678

I am also working on creating a TensorFlow version for the rest of chapter 8 and 9 and there are some small changes that will be made to rnn-scratch as well (mainly the training loop and the usage of get_params / params)

Can someone tell me, why we need to detach the state when training the model?

The section 8.7 Backpropagation through time explains this in more detail

I rewrote the predict_ch8 function so I could ask better questions about it. I’d submit a PR, but it doesn’t look like the code presented on this page is the same as on github - 2021-11

You’ll have to update the vocab instance calls since I wrote my own, but the docstring explains them.

def predict_ch8(list_tokens, num_preds, net, instance_vocab, device = torch.device('cpu')):
    """
    Generate new tokens that succeed the given set.
    This function relies on the following attributes from the vocab object
        dict_words = { word : index }
        dict_indices = { index : word }
    This will be rewritten to reflect torchtext to ensure a uniform API

    The author used a batch_size = 1, but I'm not sure how much 1 is hard-coded.
    
    The list of tokens should be tokenized using the same method the model was trained on.    
    """
    hidden_state = net.initialize_hidden_state(batch_size=1, device=device)
    indices_output = [instance_vocab.dict_words[list_tokens[0]]]
    construct_input = lambda indices: torch.tensor([indices[-1]], device=device).reshape((1, 1))

    for token in list_tokens[1:]:  # Warm-up period
        states_previous, hidden_state = net(construct_input(indices_output), hidden_state)
        indices_output.append(instance_vocab.dict_words[token])

    for _ in range(num_preds):  # Predict `num_preds` steps
        states_previous, hidden_state = net(construct_input(indices_output), hidden_state)
        indices_output.append(int(states_previous.argmax(dim=1).reshape(1)))

    return ''.join([instance_vocab.dict_indices[i] for i in indices_output])

list_list_tokens_prefix = [
    'time traveller'.split(),
    'traveller'.split(),
    'the time'.split(),
]
for list_tokens_prefix in list_list_tokens_prefix:
    string_prediction = predict_ch8(list_tokens_prefix, 50, net, vocab, torch.device('cpu'))
    print(string_prediction)

Questions:

  1. I see we’re predicting one item out. Where (either during the construction of the linear_layer in the RNN) do we define this?

  2. The docstring mentions the batch_size = 1, and I notice there are several reshape and dimension calls that state 1 as well. Can you explain the meaning behind these hard-coded items? What if we want to return two hidden states?

  3. What purpose is the warm-up period? I see it’s calling the network on the given list of tokens and returning the hidden states, but why are we doing that?

Is it a common practice to detach state, In other examples for RNN I have not see this

In the second last line of the function train_epoch_ch8:

metric.add(l * y.numel(), y.numel())

But the first elements of metrics is indicating the loss(perplexity) of the model, why are we adding the number of elements, instead of adding the loss (l) ?