Attention Pooling: Nadaraya-Watson Kernel Regression


  • In the parametric attention pooling, any training input takes key-value pairs from all the training examples except for itself to predict its output.-

I don’t know why can’t we use itself to predict its output. Is it because that the output of the softmax is too large, i.e. the weight of its value is large, so that the rest of the training set are relatively useless?

My take: the parametric version is an optimization problem to find the correct function while the non-parametric version has a function already.
Since it’s an optimization problem, the loss we optimize is based on the training examples (the error in the estimation over all the training examples). We could achieve zero loss if the gaussian kernel’s width w becomes tiny i.e. it only weights the training output at the current input and disregards everything else. There is nothing wrong inherently with that but the resulting estimator is literally just a set of points (it estimates exactly the y_i at a particular training x_i) with a width w -> 0. This leads to a straight line i.e. the “dumb” averaging estimator.


1 Like

I am not sure I understand correctly, but wouldn’t in the case of x=x_i result in


In which case the respective i_th prod would be zero?

Because x itself (key) does not have a label y (value): we are predicting its y

Inverse Kernel (Lenrek?)

Question 4 lead me onto a kernel potentially useful for the inverse purpose of most attention kernels.


self.attention_weights = (
            /torch.sum(torch.abs((queries - keys)*self.w), dim=(0))


Screen Shot 2021-03-25 at 12.16.56 PM

Screen Shot 2021-03-25 at 12.16.49 PM


Does the opposite of what attention weights normally do:
Shifts focus AWAY from the closest neighbors.
Potentially useful for components focusing on long-range dependencies.

If you find it useful, please call this kerneL ‘Lenrek’.

1 Like