https://d2l.ai/chapter_attention-mechanisms/nadaraya-watson.html

HI,

- In the parametric attention pooling, any training input takes key-value pairs from all the training examples except for itself to predict its output.-

I donât know why canât we use itself to predict its output. Is it because that the output of the softmax is too large, i.e. the weight of its value is large, so that the rest of the training set are relatively useless?

My take: the parametric version is an optimization problem to find the correct function while the non-parametric version has a function already.

Since itâs an optimization problem, the loss we optimize is based on the training examples (the error in the estimation over all the training examples). We could achieve zero loss if the gaussian kernelâs width w becomes tiny i.e. it only weights the training output at the current input and disregards everything else. There is nothing wrong inherently with that but the resulting estimator is literally just a set of points (it estimates exactly the y_i at a particular training x_i) with a width w -> 0. This leads to a straight line i.e. the âdumbâ averaging estimator.

I am not sure I understand correctly, but wouldnât in the case of x=x_i result in

sofmax(-1/2((x_i-x_i)*w)^2)

In which case the respective i_th prod would be zero?

Because x itself (key) does not have a label y (value): we are predicting its y

# Inverse Kernel (Lenrek?)

Question 4 lead me onto a kernel potentially useful for the inverse purpose of most attention kernels.

## Kernel:

```
self.attention_weights = (
torch.abs((queries-keys)*self.w)
/torch.sum(torch.abs((queries - keys)*self.w), dim=(0))
)
```

## Effect:

## Description:

Does the opposite of what attention weights normally do:

Shifts focus AWAY from the closest neighbors.

Potentially useful for components focusing on long-range dependencies.

If you find it useful, please call this kerneL âLenrekâ.