Recurrent Neural Network Implementation from Scratch

mli · July 13, 2020, 4:30pm

https://d2l.ai/chapter_recurrent-neural-networks/rnn-scratch.html

Songlin_Zheng · November 5, 2020, 2:45pm

It seems that sequential partitioning is used in this implementation for minibatch generation, so in exercise 2.2, maybe it should be ‘replace sequential partitioning with random sampling’?

Also, in the end when we test random sampling of hidden states, is it more reasonable to set d2l.load_data_time_machine(batch_size, num_steps, use_random_iter=True)?
Because in that way, each adjacent minibatch wouldn’t be adjacent in sequence, so the states for previous minibatch should be randomly initialized. If we use sequential partitioning, we should continue to use previous sates because of the continuity of data.

goldpiggy · November 8, 2020, 10:19pm

Hey @Songlin_Zheng, thanks for your proposal! Fixed in https://github.com/d2l-ai/d2l-en/pull/1514

pchen · February 27, 2021, 9:43am

In the last part, before using the random sampling method, the model should be created again. Otherwise it is a pretrained model.

sanjaradylov · February 27, 2021, 10:39am

@pchen train_ch8 forcefully reinitializes model’s parameters, so there is no need to instantiate a new model.
What should really be created anew here, prior to train_ch8 run with random sampling, is train_iter. Initially, it was created with the default use_random_iter=False, but the second time, we want it to generate random minibatches, so use_random_iter should be True. While results will likely remain the same, it is a little inconsistency to train the model with random sampling on the sequentially partitioned train_iter.

Lefan · March 1, 2021, 7:49am

I think you are right, since the perplexity of the second graph is too small at the beginning.

pchen · March 5, 2021, 6:38am

train_ch8 doesn’t reinitialize the parameters of net(). It’s only initialized when it’s created net = RNNModelScratch(...)

I tried reinitializing net() before the training of random sampling. The training curve looked completely different.

sanjaradylov · March 5, 2021, 8:23am

@pchen Oh, I see… Reinitialization of RNNModelScratch does not happen because it does not inherit mxnet.gluon.Block / torch.nn.Module.

I was talking about running train_ch8(..., use_random_iter=True) on sequentially partitioned train_iter (current implementation) vs. randomly sampled train_iter (expected). With the default hyperparameters, I got almost identical convergence rates and the final perplexity values. Although, some hyperparameter tuning might be required for more objectivity.

floriandonhauser · March 17, 2021, 12:06pm

@pchen @sanjaradylov
You are correct, the training should start from new.
I have created a Github issue for it:
https://github.com/d2l-ai/d2l-en/issues/1678

I am also working on creating a TensorFlow version for the rest of chapter 8 and 9 and there are some small changes that will be made to rnn-scratch as well (mainly the training loop and the usage of get_params / params)

BH_L · May 23, 2021, 2:25pm

Can someone tell me, why we need to detach the state when training the model?

sushmit86 · June 30, 2021, 8:12pm

The section 8.7 Backpropagation through time explains this in more detail

dhern023 · November 14, 2021, 8:20pm

I rewrote the predict_ch8 function so I could ask better questions about it. I’d submit a PR, but it doesn’t look like the code presented on this page is the same as on github - 2021-11

You’ll have to update the vocab instance calls since I wrote my own, but the docstring explains them.

def predict_ch8(list_tokens, num_preds, net, instance_vocab, device = torch.device('cpu')):
    """
    Generate new tokens that succeed the given set.
    This function relies on the following attributes from the vocab object
        dict_words = { word : index }
        dict_indices = { index : word }
    This will be rewritten to reflect torchtext to ensure a uniform API

    The author used a batch_size = 1, but I'm not sure how much 1 is hard-coded.
    
    The list of tokens should be tokenized using the same method the model was trained on.    
    """
    hidden_state = net.initialize_hidden_state(batch_size=1, device=device)
    indices_output = [instance_vocab.dict_words[list_tokens[0]]]
    construct_input = lambda indices: torch.tensor([indices[-1]], device=device).reshape((1, 1))

    for token in list_tokens[1:]:  # Warm-up period
        states_previous, hidden_state = net(construct_input(indices_output), hidden_state)
        indices_output.append(instance_vocab.dict_words[token])

    for _ in range(num_preds):  # Predict `num_preds` steps
        states_previous, hidden_state = net(construct_input(indices_output), hidden_state)
        indices_output.append(int(states_previous.argmax(dim=1).reshape(1)))

    return ''.join([instance_vocab.dict_indices[i] for i in indices_output])

list_list_tokens_prefix = [
    'time traveller'.split(),
    'traveller'.split(),
    'the time'.split(),
]
for list_tokens_prefix in list_list_tokens_prefix:
    string_prediction = predict_ch8(list_tokens_prefix, 50, net, vocab, torch.device('cpu'))
    print(string_prediction)

Questions:

I see we’re predicting one item out. Where (either during the construction of the linear_layer in the RNN) do we define this?
The docstring mentions the batch_size = 1, and I notice there are several reshape and dimension calls that state 1 as well. Can you explain the meaning behind these hard-coded items? What if we want to return two hidden states?
What purpose is the warm-up period? I see it’s calling the network on the given list of tokens and returning the hidden states, but why are we doing that?

sushmit86 · March 11, 2022, 7:17pm

Is it a common practice to detach state, In other examples for RNN I have not see this

Isaacwu0718 · May 1, 2022, 10:03am

In the second last line of the function train_epoch_ch8:

metric.add(l * y.numel(), y.numel())

But the first elements of metrics is indicating the loss(perplexity) of the model, why are we adding the number of elements, instead of adding the loss (l) ?

Palash_Nandi · July 14, 2023, 12:35pm

But why am I getting a syntax error while executing the “construct_input” or the lambda function “get_input” as in the text. The problem am getting is { ‘int’ object is not subscriptable } when am trying to execute the lamda. On the whole when am executing the program it says “Could not infer dtype of method”. Can anybody please help me with this…

pandalabme · September 4, 2023, 8:35am

My solutions to the exs: 9.5

Agile_Developer · February 8, 2024, 4:40pm

I do not see the output layer getting a softmax. Normally it should be transformed to probabilistic index predictions.

The classifier has a loss about softmax with integer labels.

I’m confused.