Multi-Head Attention


i make a experiments:
in multi_attention the output is [batch_size,num_heads,queries,num_hiddens/num_heads]
i want to add a softmax to the output to check the weights of the output .
the vectof of first vector is [batch_size,0,queries,num_hiddens/num_heads] ,repeate the last dim num_heads, so we can have a vector [batch_size,num_heads,queries,num_hiddens]. then use the softmax weight to measure the weight of the heads.