Attention Cues

https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-cues.html

Great book!
And about the answer of excersize 1,
1. What canbe the volitional cue when decoding a sequence token by token in machine translation? What are the nonvolitional cues and the sensory inputs?
I think the sensory inputs are embeddings of tokens, and what is the volutional cue ?

The nonvolitional cues are the sourcing input sequence.
The volitional cues are the predicted/translated output sequence of each time.
The sensory inputs are the target language dictionary.

attention_weights = torch.rand(10, 10) # generate 10x10 random matrix
m = torch.nn.Softmax(dim=0) # softmax on dim0
out = m(attention_weights) # apply softmax

# the following two lines are simply borrowed from the example
attention_weights = out.reshape((1, 1, 10, 10))
show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries')

I think the volitional cue here is the decoder state in each time step, the sensory inputs are the hidden states from each encoding time step . You bias the selection by weighted average of values ( hidden state from each encoding time step) for a given volitional cues ( i.e. the hidden state of the decoder at given time step)

In my point of view, I think the encoder hidden state could become the volitional cue(queries), and the hidden states of the decoder are the non-volitional cues. I think the encoder hidden state stores the information of the original input text, so this information should be able to bias the hidden state of the decoder. As a result, decoder can focus on hidden states that make sense according to the encoder hidden states.

Question 1.
What can be the volitional cue when decoding a sequence token by token in machine translation? What are the nonvolitional cues and the sensory inputs?

The volitional cues (queries) are the desired “words” for conversion. (e.g. “egg”)

The nonvolitional cues (keys) are the training input “words” paired with the output words. (e.g “egg in egg-huevo pair (huevo is the Spanish word for egg))”

The sensory inputs (values are the training output “words” paired with the input words. (e.g “huevo in egg-huevo pair (huevo is the Spanish word for egg))”