It seems that sequential partitioning is used in this implementation for minibatch generation, so in exercise 2.2, maybe it should be ‘replace sequential partitioning with random sampling’?
Also, in the end when we test random sampling of hidden states, is it more reasonable to set d2l.load_data_time_machine(batch_size, num_steps, use_random_iter=True)?
Because in that way, each adjacent minibatch wouldn’t be adjacent in sequence, so the states for previous minibatch should be randomly initialized. If we use sequential partitioning, we should continue to use previous sates because of the continuity of data.
In the last part, before using the random sampling method, the model should be created again. Otherwise it is a pretrained model.
train_ch8 forcefully reinitializes model’s parameters, so there is no need to instantiate a new model.
What should really be created anew here, prior to
train_ch8 run with random sampling, is
train_iter. Initially, it was created with the default
use_random_iter=False, but the second time, we want it to generate random minibatches, so
use_random_iter should be
True. While results will likely remain the same, it is a little inconsistency to train the model with random sampling on the sequentially partitioned
I think you are right, since the perplexity of the second graph is too small at the beginning.
train_ch8 doesn’t reinitialize the parameters of
net(). It’s only initialized when it’s created
net = RNNModelScratch(...)
I tried reinitializing
net() before the training of random sampling. The training curve looked completely different.
@pchen Oh, I see… Reinitialization of
RNNModelScratch does not happen because it does not inherit
I was talking about running
train_ch8(..., use_random_iter=True) on sequentially partitioned
train_iter (current implementation) vs. randomly sampled
train_iter (expected). With the default hyperparameters, I got almost identical convergence rates and the final perplexity values. Although, some hyperparameter tuning might be required for more objectivity.
I am also working on creating a TensorFlow version for the rest of chapter 8 and 9 and there are some small changes that will be made to rnn-scratch as well (mainly the training loop and the usage of get_params / params)
Can someone tell me, why we need to detach the state when training the model?
8.7 Backpropagation through time explains this in more detail