Sequence to Sequence Learning

astonzhang · September 17, 2020, 4:50am

https://d2l.ai/chapter_recurrent-modern/seq2seq.html

swg104 · November 1, 2020, 10:28pm

in the training function above, why do we need to sum the loss before we do the backward?

l.sum().backward() # Make the loss scalar for backward`

Also, why do we do the clipping?

d2l.grad_clipping(model, 1)

swg104 · November 1, 2020, 10:30pm

another observation, it wasn’t clear to me why we effectively throw out the output of the encoder, we just use the state to initialize the decoder. What’s the intuition for this?

goldpiggy · November 2, 2020, 11:34pm

Hi @swg104, the backward() is applied on the final loss, see more details in the backpropagation section.

Section 8.5 talked about why.

wusq121 · January 3, 2021, 1:42pm

why do we need to eval() when we test the s2sencoder or s2sdecoder?
but at predict stage there is no such opearation.

anirudh · January 3, 2021, 3:55pm

PyTorch has two modes, eval and train. eval mode should be used during inference. eval mode will disable dropout (and do the appropriate scaling of the weights), also it will make batch norm work on the averages computed during training. Btw predict_seq2seq has net.eval, you can check the preview version. This fix should be reflected in the next release.

shaoming_xu · January 30, 2021, 3:55am

In predict_seq2seq()

for _ in range(num_steps):
Y, dec_state = net.decoder(dec_X, dec_state)

Here dec_state is recursively returned from and used by the net.decoder.
I feel this doesn’t match the Fig. 9.7.3 where all the dec_X is concatenate with the last encoder state.
In other word, the dec_state should alway be kept as the same as the code below does.
Y, _ = net.decoder(dec_X, dec_state)

But, the new code makes a problem that in net.decoder, dec_state is also used to init the state for next timestep. Therefore, in current framework, maybe the original code could be the best solution. Or we could need to adjust the code of class Seq2SeqDecoder(d2l.Decoder)?

Please help to point out if I am right.

Thanks!

sanjaradylov · January 30, 2021, 9:43pm

@shaoming_xu Hi! I think you are right – the code looks buggy. One way to fix this is to pass two state variables, enc_state as additional context and dec_state as the decoder’s hidden state, into the decoder during forward propagation, and return only dec_state along with output as enc_state is only copied but not processed:

class Seq2SeqDecoder(Decoder):
    ...
    def forward(self, X, enc_state, dec_state):
        X = self.embedding(X).permute(1, 0, 2)
        # `enc_state` as additional context
        context = enc_state[-1].repeat(X.shape[0], 1, 1)
        X_and_context = torch.cat((X, context), 2)
        # `dec_state` as the hidden state of the decoder
        output, dec_state = self.rnn(X_and_context, dec_state)
        output = self.dense(output).permute(1, 0, 2)
        return output, dec_state

class EncoderDecoder(nn.Module):
    ...
    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state, dec_state)

def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
                    device, save_attention_weights=False):
    ...
    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
    enc_state = dec_state.clone().detach()
    ...
    for _ in range(num_steps):
        Y, dec_state = net.decoder(dec_X, enc_state, dec_state)
    ...

six · March 16, 2021, 8:41pm

Did you and @shaoming_xu notice the Ecoder’s state is used and then returned by the Decoder’s init_state method?

9.6.3. Putting the Encoder and Decoder Together

class EncoderDecoder(nn.Module):
    """The base class for the encoder-decoder architecture."""
    def __init__(self, encoder, decoder, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)

9.7.2. Decoder

class Seq2SeqDecoder(d2l.Decoder):
    """The RNN decoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        ...

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]
    ...

six · March 16, 2021, 8:42pm

Anyone have ideas on Question 6? (Other output-layer design ideas).

LongWU · March 19, 2021, 4:55pm

Hi,
I am not understand why there is no opt.zero_grad() after the opt.step() call? My understanding is the grad will accumulate if no zero_grad() called.

HeartSea15 · April 14, 2021, 6:01am

why average the num_steps, rather than valid_len of the num_steps???

min_xu · April 14, 2021, 10:21am

i think so too!
i think the code should be

sum all the vaild code loss.

do you agree?

HeartSea15 · April 14, 2021, 11:00am

This is my idea.

thanks reply!

min_xu · April 15, 2021, 1:20am

thank you for your reply!

in my opion:

l.sum() is all the loss of batch_size*vaild_len. num_tokens is all the tokens of the batch_size too. so i think wo don’t need to /batch_size. this is my opinion. thank you!

Bay_Leaf · October 5, 2021, 9:09pm

enc_outputs = net.encoder(enc_X, enc_valid_len)
how enc_valid_len is used in encoder forward method?
here i don’t see how enc_valid_len is used in the forward method of Seq2SeqEncoder class

Blockquote

def forward(self, X, *args):
    # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
    X = self.embedding(X)
    # In RNN models, the first axis corresponds to time steps
    X = X.permute(1, 0, 2)
    # When state is not mentioned, it defaults to zeros
    output, state = self.rnn(X)
    # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
    # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
    return output, state

Blockquote

captainst · October 22, 2021, 7:46am

me neither. What’s more, in function predict_seq2seq, the enc_valid_len is not used, either.

captainst · October 23, 2021, 2:28pm

I think I figured out this mysterious enc_valid_len. In chapter 10 Attention Mechanisms, the seq2seq function is reused, where, the enc_valid_len is used for masked operation.

captainst · October 27, 2021, 4:04pm

I think that @ sanjaradylov and shaoming_xu are referring to how the contect variable “c” is implemented in the predict_seq2seq() function.
From Fig 7.9.1, the context variable “c” is the last hidden state from the encoder, which should remain unchanged during the iterations of decoder steps. However, the current implementation of predict_seq2seq() recursively uses the dec_state:

Y, dec_state = net.decoder(dec_X, dec_state)

This is incorrect, at least according to the design.

cgr71 · December 27, 2021, 10:08pm

Hi! Has someone been able to make this encoder-decoder get good results? I’ve implemented the mentioned code, improved it (e.g. lr scheduler, different aproaches like remove the concatenated context in the inputs of the decoder), try different parameters, etc. and it’s still not getting good results… It is overfitting using 1000, 10000 and 100000 sentences… I have tried to make a smaller model but still is overfitting. I’m trying to translate from spanish to english with Tatoeba corpus from Opus