Implementation of Recurrent Neural Networks from Scratch

It seems that sequential partitioning is used in this implementation for minibatch generation, so in exercise 2.2, maybe it should be ‘replace sequential partitioning with random sampling’?

Also, in the end when we test random sampling of hidden states, is it more reasonable to set d2l.load_data_time_machine(batch_size, num_steps, use_random_iter=True)?
Because in that way, each adjacent minibatch wouldn’t be adjacent in sequence, so the states for previous minibatch should be randomly initialized. If we use sequential partitioning, we should continue to use previous sates because of the continuity of data.

Hey @Songlin_Zheng, thanks for your proposal! Fixed in

In the last part, before using the random sampling method, the model should be created again. Otherwise it is a pretrained model.

@pchen train_ch8 forcefully reinitializes model’s parameters, so there is no need to instantiate a new model.
What should really be created anew here, prior to train_ch8 run with random sampling, is train_iter. Initially, it was created with the default use_random_iter=False, but the second time, we want it to generate random minibatches, so use_random_iter should be True. While results will likely remain the same, it is a little inconsistency to train the model with random sampling on the sequentially partitioned train_iter.

I think you are right, since the perplexity of the second graph is too small at the beginning.

train_ch8 doesn’t reinitialize the parameters of net(). It’s only initialized when it’s created net = RNNModelScratch(...)

I tried reinitializing net() before the training of random sampling. The training curve looked completely different.

1 Like

@pchen Oh, I see… Reinitialization of RNNModelScratch does not happen because it does not inherit mxnet.gluon.Block / torch.nn.Module.

I was talking about running train_ch8(..., use_random_iter=True) on sequentially partitioned train_iter (current implementation) vs. randomly sampled train_iter (expected). With the default hyperparameters, I got almost identical convergence rates and the final perplexity values. Although, some hyperparameter tuning might be required for more objectivity.

@pchen @sanjaradylov
You are correct, the training should start from new.
I have created a Github issue for it:

I am also working on creating a TensorFlow version for the rest of chapter 8 and 9 and there are some small changes that will be made to rnn-scratch as well (mainly the training loop and the usage of get_params / params)

Can someone tell me, why we need to detach the state when training the model?

The section 8.7 Backpropagation through time explains this in more detail