Sequence to Sequence Learning

shaoming_xu · January 30, 2021, 3:55am

In predict_seq2seq()

for _ in range(num_steps):
Y, dec_state = net.decoder(dec_X, dec_state)

Here dec_state is recursively returned from and used by the net.decoder.
I feel this doesn’t match the Fig. 9.7.3 where all the dec_X is concatenate with the last encoder state.
In other word, the dec_state should alway be kept as the same as the code below does.
Y, _ = net.decoder(dec_X, dec_state)

But, the new code makes a problem that in net.decoder, dec_state is also used to init the state for next timestep. Therefore, in current framework, maybe the original code could be the best solution. Or we could need to adjust the code of class Seq2SeqDecoder(d2l.Decoder)?

Please help to point out if I am right.

Thanks!

sanjaradylov · January 30, 2021, 9:43pm

@shaoming_xu Hi! I think you are right – the code looks buggy. One way to fix this is to pass two state variables, enc_state as additional context and dec_state as the decoder’s hidden state, into the decoder during forward propagation, and return only dec_state along with output as enc_state is only copied but not processed:

class Seq2SeqDecoder(Decoder):
    ...
    def forward(self, X, enc_state, dec_state):
        X = self.embedding(X).permute(1, 0, 2)
        # `enc_state` as additional context
        context = enc_state[-1].repeat(X.shape[0], 1, 1)
        X_and_context = torch.cat((X, context), 2)
        # `dec_state` as the hidden state of the decoder
        output, dec_state = self.rnn(X_and_context, dec_state)
        output = self.dense(output).permute(1, 0, 2)
        return output, dec_state

class EncoderDecoder(nn.Module):
    ...
    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state, dec_state)

def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
                    device, save_attention_weights=False):
    ...
    dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)
    enc_state = dec_state.clone().detach()
    ...
    for _ in range(num_steps):
        Y, dec_state = net.decoder(dec_X, enc_state, dec_state)
    ...

six · March 16, 2021, 8:41pm

Did you and @shaoming_xu notice the Ecoder’s state is used and then returned by the Decoder’s init_state method?

9.6.3. Putting the Encoder and Decoder Together

class EncoderDecoder(nn.Module):
    """The base class for the encoder-decoder architecture."""
    def __init__(self, encoder, decoder, **kwargs):
        super(EncoderDecoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, enc_X, dec_X, *args):
        enc_outputs = self.encoder(enc_X, *args)
        dec_state = self.decoder.init_state(enc_outputs, *args)
        return self.decoder(dec_X, dec_state)

9.7.2. Decoder

class Seq2SeqDecoder(d2l.Decoder):
    """The RNN decoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        ...

    def init_state(self, enc_outputs, *args):
        return enc_outputs[1]
    ...

six · March 16, 2021, 8:42pm

Anyone have ideas on Question 6? (Other output-layer design ideas).

LongWU · March 19, 2021, 4:55pm

Hi,
I am not understand why there is no opt.zero_grad() after the opt.step() call? My understanding is the grad will accumulate if no zero_grad() called.

HeartSea15 · April 14, 2021, 6:01am

why average the num_steps, rather than valid_len of the num_steps???

min_xu · April 14, 2021, 10:21am

i think so too!
i think the code should be

sum all the vaild code loss.

do you agree?

HeartSea15 · April 14, 2021, 11:00am

This is my idea.

thanks reply!

min_xu · April 15, 2021, 1:20am

thank you for your reply!

in my opion:

l.sum() is all the loss of batch_size*vaild_len. num_tokens is all the tokens of the batch_size too. so i think wo don’t need to /batch_size. this is my opinion. thank you!

Bay_Leaf · October 5, 2021, 9:09pm

enc_outputs = net.encoder(enc_X, enc_valid_len)
how enc_valid_len is used in encoder forward method?
here i don’t see how enc_valid_len is used in the forward method of Seq2SeqEncoder class

Blockquote

def forward(self, X, *args):
    # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
    X = self.embedding(X)
    # In RNN models, the first axis corresponds to time steps
    X = X.permute(1, 0, 2)
    # When state is not mentioned, it defaults to zeros
    output, state = self.rnn(X)
    # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
    # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
    return output, state

Blockquote

captainst · October 22, 2021, 7:46am

me neither. What’s more, in function predict_seq2seq, the enc_valid_len is not used, either.

captainst · October 23, 2021, 2:28pm

I think I figured out this mysterious enc_valid_len. In chapter 10 Attention Mechanisms, the seq2seq function is reused, where, the enc_valid_len is used for masked operation.

captainst · October 27, 2021, 4:04pm

I think that @ sanjaradylov and shaoming_xu are referring to how the contect variable “c” is implemented in the predict_seq2seq() function.
From Fig 7.9.1, the context variable “c” is the last hidden state from the encoder, which should remain unchanged during the iterations of decoder steps. However, the current implementation of predict_seq2seq() recursively uses the dec_state:

Y, dec_state = net.decoder(dec_X, dec_state)

This is incorrect, at least according to the design.

cgr71 · December 27, 2021, 10:08pm

Hi! Has someone been able to make this encoder-decoder get good results? I’ve implemented the mentioned code, improved it (e.g. lr scheduler, different aproaches like remove the concatenated context in the inputs of the decoder), try different parameters, etc. and it’s still not getting good results… It is overfitting using 1000, 10000 and 100000 sentences… I have tried to make a smaller model but still is overfitting. I’m trying to translate from spanish to english with Tatoeba corpus from Opus

shyam17 · August 20, 2022, 11:28am

What is the purpose of the following term $\exp(\min(0, 1-\frac{len_{label}}{len_{pred}})$ in the BLEU score?

Vortexx2 · December 28, 2022, 1:51pm

We try to penalise the more the difference is between the predicted sequence length and the target sequence length. Therefore, this term gives 1 for when both sequences have the same length, and diverges otherwise.

pandalabme · September 7, 2023, 3:14am

My solutions to the exs: 10.7

JH.Lam · December 7, 2023, 4:33pm

many tricks which are enabled in previous sections have been removed(ignored) in this section ,eg. detach, zero grad etc.but I think this is a simplified version instead

JH.Lam · December 7, 2023, 4:46pm

how about ‘edit distance’?

Mohamed_Ahmed_Naji · July 1, 2024, 7:24pm

is it like concatenating the 2 sequences: [eng fra]
with 2 GRUs
_, state = GRU1(eng sequence, None)
outputs, _ =GRU2(fra sequence, state)
#or
#outputs, _ =GRU2(cat[fra sequence, state], None)
Y_hat = linear(outputs)
loss(Y_hat, fra)
if this correct then could it not be EncoderDecoder