自注意力,好像并不能用上一节所写的多头注意力执行,它的意义好像存在问题。
若是用多头注意力运算自注意力,多头注意力的一个总的向量分开之后却是和其他的查询的同一批次且同一索引的头的向量进行运算。
bmm((batchnum_head, num_query, query_dim/num_head), (batchnum_head, query_dim/num_head, num_query)) → (batchnum_head, num_query, num_query)本应为(batchnum_query, num_head, num_head)
所以代码可能稍微更正一下,也许会让逻辑更加正确
class Self_attention(nn.Module):
def init(self,
input_size,
num_hiddens,
value_size,
dropout:float=0.,
):
super(Self_attention, self).init()
self.w_q = nn.Linear(input_size, num_hiddens, bias=False)
self.w_k = nn.Linear(input_size, num_hiddens, bias=False)
self.w_v = nn.Linear(input_size, value_size, bias=False)
self.attention = Dot_attention(dropout=dropout)
def forward(self, x: torch.Tensor):
'''
:param x: queries.shape = (batch, num_vector, vector_dim)
:return: return.shape = (batch, num_vector, value_dim)
'''
queries, keys, values = self.w_q(x), self.w_k(x), self.w_v(x)
return self.attention(queries, keys, values)
希望大家帮我审视一下我的观点,若是有错误,请尽情提出来。