Sequence to Sequence

in the training function above, why do we need to sum the loss before we do the backward?

l.sum().backward() # Make the loss scalar for backward`

Also, why do we do the clipping?

d2l.grad_clipping(model, 1)

another observation, it wasn’t clear to me why we effectively throw out the output of the encoder, we just use the state to initialize the decoder. What’s the intuition for this?

Hi @swg104, the backward() is applied on the final loss, see more details in the backpropagation section.

Section 8.5 talked about why. :wink:

why do we need to eval() when we test the s2sencoder or s2sdecoder?
but at predict stage there is no such opearation.

PyTorch has two modes, eval and train. eval mode should be used during inference. eval mode will disable dropout (and do the appropriate scaling of the weights), also it will make batch norm work on the averages computed during training. Btw predict_seq2seq has net.eval, you can check the preview version. This fix should be reflected in the next release.

In predict_seq2seq()

for _ in range(num_steps):
Y, dec_state = net.decoder(dec_X, dec_state)

Here dec_state is recursively returned from and used by the net.decoder.
I feel this doesn’t match the Fig. 9.7.3 where all the dec_X is concatenate with the last encoder state.
In other word, the dec_state should alway be kept as the same as the code below does.
Y, _ = net.decoder(dec_X, dec_state)

But, the new code makes a problem that in net.decoder, dec_state is also used to init the state for next timestep. Therefore, in current framework, maybe the original code could be the best solution. Or we could need to adjust the code of class Seq2SeqDecoder(d2l.Decoder)?

Please help to point out if I am right.


@shaoming_xu Hi! I think you are right – the code looks buggy. One way to fix this is to pass two state variables, enc_state as additional context and dec_state as the decoder’s hidden state, into the decoder during forward propagation, and return only dec_state along with output as enc_state is only copied but not processed:

class Seq2SeqDecoder(Decoder):
    def forward(self, X, enc_state, dec_state):
        X = self.embedding(X).permute(1, 0, 2)
        # `enc_state` as additional context
        context = enc_state[-1].repeat(X.shape[0], 1, 1)
        X_and_context =, context), 2)
        # `dec_state` as the hidden state of the decoder
        output, dec_state = self.rnn(X_and_context, dec_state)
        output = self.dense(output).permute(1, 0, 2)
        return output, dec_state
class EncoderDecoder(nn.Module):
    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state, dec_state)
def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
                    device, save_attention_weights=False):
    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
    enc_state = dec_state.clone().detach()
    for _ in range(num_steps):
        Y, dec_state = net.decoder(dec_X, enc_state, dec_state)

Did you and @shaoming_xu notice the Ecoder’s state is used and then returned by the Decoder’s init_state method?

9.6.3. Putting the Encoder and Decoder Together

class EncoderDecoder(nn.Module):
    """The base class for the encoder-decoder architecture."""
    def __init__(self, encoder, decoder, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)

9.7.2. Decoder

class Seq2SeqDecoder(d2l.Decoder):
    """The RNN decoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]

Anyone have ideas on Question 6? (Other output-layer design ideas).

I am not understand why there is no opt.zero_grad() after the opt.step() call? My understanding is the grad will accumulate if no zero_grad() called.

why average the num_steps, rather than valid_len of the num_steps???

i think so too!
i think the code should be

sum all the vaild code loss.

do you agree?

This is my idea.

thanks reply!

thank you for your reply!

in my opion:

l.sum() is all the loss of batch_size*vaild_len. num_tokens is all the tokens of the batch_size too. so i think wo don’t need to /batch_size. this is my opinion. thank you!

thank you, and I agree with you.