And about the answer of excersize 1,
1. What canbe the volitional cue when decoding a sequence token by token in machine translation? What are the nonvolitional cues and the sensory inputs?
I think the sensory inputs are embeddings of tokens, and what is the volutional cue ?
The nonvolitional cues are the sourcing input sequence.
The volitional cues are the predicted/translated output sequence of each time.
The sensory inputs are the target language dictionary.
attention_weights = torch.rand(10, 10) # generate 10x10 random matrix m = torch.nn.Softmax(dim=0) # softmax on dim0 out = m(attention_weights) # apply softmax # the following two lines are simply borrowed from the example attention_weights = out.reshape((1, 1, 10, 10)) show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries')
I think the volitional cue here is the decoder state in each time step, the sensory inputs are the hidden states from each encoding time step . You bias the selection by weighted average of values ( hidden state from each encoding time step) for a given volitional cues ( i.e. the hidden state of the decoder at given time step)
In my point of view, I think the encoder hidden state could become the volitional cue(queries), and the hidden states of the decoder are the non-volitional cues. I think the encoder hidden state stores the information of the original input text, so this information should be able to bias the hidden state of the decoder. As a result, decoder can focus on hidden states that make sense according to the encoder hidden states.