https://d2l.ai/chapter_recurrent-neural-networks/rnn-scratch.html
It seems that sequential partitioning is used in this implementation for minibatch generation, so in exercise 2.2, maybe it should be āreplace sequential partitioning with random samplingā?
Also, in the end when we test random sampling of hidden states, is it more reasonable to set d2l.load_data_time_machine(batch_size, num_steps, use_random_iter=True)?
Because in that way, each adjacent minibatch wouldnāt be adjacent in sequence, so the states for previous minibatch should be randomly initialized. If we use sequential partitioning, we should continue to use previous sates because of the continuity of data.
In the last part, before using the random sampling method, the model should be created again. Otherwise it is a pretrained model.
@pchen train_ch8
forcefully reinitializes modelās parameters, so there is no need to instantiate a new model.
What should really be created anew here, prior to train_ch8
run with random sampling, is train_iter
. Initially, it was created with the default use_random_iter=False
, but the second time, we want it to generate random minibatches, so use_random_iter
should be True
. While results will likely remain the same, it is a little inconsistency to train the model with random sampling on the sequentially partitioned train_iter
.
I think you are right, since the perplexity of the second graph is too small at the beginning.
train_ch8
doesnāt reinitialize the parameters of net()
. Itās only initialized when itās created net = RNNModelScratch(...)
I tried reinitializing net()
before the training of random sampling. The training curve looked completely different.
@pchen Oh, I seeā¦ Reinitialization of RNNModelScratch
does not happen because it does not inherit mxnet.gluon.Block
/ torch.nn.Module
.
I was talking about running train_ch8(..., use_random_iter=True)
on sequentially partitioned train_iter
(current implementation) vs. randomly sampled train_iter
(expected). With the default hyperparameters, I got almost identical convergence rates and the final perplexity values. Although, some hyperparameter tuning might be required for more objectivity.
@pchen @sanjaradylov
You are correct, the training should start from new.
I have created a Github issue for it:
https://github.com/d2l-ai/d2l-en/issues/1678
I am also working on creating a TensorFlow version for the rest of chapter 8 and 9 and there are some small changes that will be made to rnn-scratch as well (mainly the training loop and the usage of get_params / params)
Can someone tell me, why we need to detach the state when training the model?
The section 8.7 Backpropagation through time
explains this in more detail
I rewrote the predict_ch8 function so I could ask better questions about it. Iād submit a PR, but it doesnāt look like the code presented on this page is the same as on github - 2021-11
Youāll have to update the vocab instance calls since I wrote my own, but the docstring explains them.
def predict_ch8(list_tokens, num_preds, net, instance_vocab, device = torch.device('cpu')):
"""
Generate new tokens that succeed the given set.
This function relies on the following attributes from the vocab object
dict_words = { word : index }
dict_indices = { index : word }
This will be rewritten to reflect torchtext to ensure a uniform API
The author used a batch_size = 1, but I'm not sure how much 1 is hard-coded.
The list of tokens should be tokenized using the same method the model was trained on.
"""
hidden_state = net.initialize_hidden_state(batch_size=1, device=device)
indices_output = [instance_vocab.dict_words[list_tokens[0]]]
construct_input = lambda indices: torch.tensor([indices[-1]], device=device).reshape((1, 1))
for token in list_tokens[1:]: # Warm-up period
states_previous, hidden_state = net(construct_input(indices_output), hidden_state)
indices_output.append(instance_vocab.dict_words[token])
for _ in range(num_preds): # Predict `num_preds` steps
states_previous, hidden_state = net(construct_input(indices_output), hidden_state)
indices_output.append(int(states_previous.argmax(dim=1).reshape(1)))
return ''.join([instance_vocab.dict_indices[i] for i in indices_output])
list_list_tokens_prefix = [
'time traveller'.split(),
'traveller'.split(),
'the time'.split(),
]
for list_tokens_prefix in list_list_tokens_prefix:
string_prediction = predict_ch8(list_tokens_prefix, 50, net, vocab, torch.device('cpu'))
print(string_prediction)
Questions:
-
I see weāre predicting one item out. Where (either during the construction of the linear_layer in the RNN) do we define this?
-
The docstring mentions the batch_size = 1, and I notice there are several reshape and dimension calls that state 1 as well. Can you explain the meaning behind these hard-coded items? What if we want to return two hidden states?
-
What purpose is the warm-up period? I see itās calling the network on the given list of tokens and returning the hidden states, but why are we doing that?
Is it a common practice to detach state, In other examples for RNN I have not see this
In the second last line of the function train_epoch_ch8:
metric.add(l * y.numel(), y.numel())
But the first elements of metrics is indicating the loss(perplexity) of the model, why are we adding the number of elements, instead of adding the loss (l) ?
But why am I getting a syntax error while executing the āconstruct_inputā or the lambda function āget_inputā as in the text. The problem am getting is { āintā object is not subscriptable } when am trying to execute the lamda. On the whole when am executing the program it says āCould not infer dtype of methodā. Can anybody please help me with thisā¦
I do not see the output layer getting a softmax. Normally it should be transformed to probabilistic index predictions.
The classifier has a loss about softmax with integer labels.
Iām confused.