The Transformer Architecture

Why are we using a fixed positional encoding here? Sure I can see how this encoding works and the relationship between positions are found in later layers of the network, but can’t we remove the need for this layer by learning the positional encoding? For instance we could train a d x 1 convolutional filter as the positional encoding.

In fact why do we add the positional encoding to the original data? Wouldn’t that make us lose some data in the process of adding in the encoding? What’s wrong with simply appending the position onto each data point?

I found the answer to my question here

You can learn the positional embedding, and there are several ways to do it. However, the difference between learned and other non-learned, but well done manual embeddings are small.

Concatenation of the position is a good idea, in most cases as it ensures orthogonality. However, the dimensions are so high in most embeddings, so we get an approximate orthogonality. Now it is better to add, as we have less parameters to solve for.

Hi @Van_Tran, great questions and answers! The fundamental intuition behind positional encoding is to keep some level of sequential information, while it doesn’t need to 100% precise information.

In multiheadattention layer, the transpose_qkv function splits the embedding vector and ‘brings’ the ‘num_heads’ to ‘batch_size’ dimension

(batch_size, seq_len, num_hiddens) to
(batch_size * num_heads, seq_len, num_hiddens / num_heads)

batch_soze = 2, seq_len = 3, num_hiddens = 4
num_heads = 2

X = np.arange(24)
X = X.reshape(2,3,4)
X = transpose_qkv(X, num_heads)
(2,3,4) -> (4,3,2)

[[[ 0 1 2 3][ 4 5 6 7][ 8 9 10 11]]
[[12 13 14 15][16 17 18 19][20 21 22 23]]]

After tranpose_qkv, the order of output element order is
(1 2) -> (1 1 2 2)

[[[ 0 1][ 4 5][ 8 9]]
[[ 2 3][ 6 7][10 11]]
[[12 13][16 17][20 21]]
[[14 15][18 19][22 23]]]

However, the valid_len vector has produced a different element order
valid_len = np.array([1,2])
valid_len = np.tile(valid_len, num_heads)
(1 2) -> (1 2 1 2)

token order = (1 1 2 2)
valid_len order = (1 2 1 2)

The following codes fix this bug:
valid_len = np.tile(valid_len, self.num_heads)
valid_len = valid_len.reshape(self.num_heads, -1)
valid_len = valid_len.transpose(1,0)
valid_len = valid_len.reshape(-1)

valid_len order = (1 1 2 2)

Tested the fix with SQUAD problem.
With original code, the training does not converge. It is able to converge with the changes.

Appreciate any help to verify. Many thanks!

@goldpiggy hi, there, i have the same question here in Multi-Head Attention Part.

to my understanding,
let we have n batches, m heads.

X in dimension: (batch_size * num_heads, seq_len, num_hiddens / num_heads)

shoud be see as

batch 1, head 1, …
batch 1, head 2, …

batch 1, head m, …

batch n, head m, …

the valid_len is in dimension (batch_size)

batch 1, head 1
batch 2, head 1

batch n, head 1

with np.tile we extend it to (batch_size * num_heads)

batch 1, head 1
batch 2, head 1

batch n, head 1

batch n, head m

as self.attention implemented in 10.1 Attention Mechanisms - Dot Product Attention
the X and valid_len will be used in the same order,
so we get

X | valid_len

batch 1, head 1, … --> batch 1, head 1
batch 1, head 2, … --> batch 2, head 1 (starting to get things wrong)

batch 1, head m, … --> batch n, head m (wrong)

batch n, head m, …

if this make sense, the transpose_qkv shoud be change to

def transpose_qkv(X, num_heads):
# Input X shape: (batch_size, seq_len, num_hiddens).
# Output X shape:
# (batch_size, seq_len, num_heads, num_hiddens / num_heads)
X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
# X shape:
# (num_heads, batch_size, seq_len, num_hiddens / num_heads)
X = X.transpose(2, 0, 1, 3)
# output shape:
# (num_heads*batch_size , seq_len, num_hiddens / num_heads)
output = X.reshape(-1, X.shape[2], X.shape[3])
return output

so does the correspond transpose_output function.

hoping to get your reply soon.


Thanks @karloon and @yaohwang! We should use repeat(axis=0) instead of tile (fixed at The former does the following: on axis 0, copy the first item (scalar or vector) for N times, then copy the next item, and so on.

To explain, in our d2l.transpose_output function, we have:

X = X.reshape(-1, num_heads, X.shape[1], X.shape[2]),

which means that our head-wise copy should align with shape:

(batch_size, num_heads, num_steps, num_hiddens / num_heads).

Suppose that num_heads=4, valid_lens is [[1,2,3],[4,5,6]] (which means batch_size=2, num_steps=3)

valid_lens = valid_lens.repeat(self.num_heads, axis=0)

gives [[1,2,3],[1,2,3],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[4,5,6],[4,5,6]], whose reshaping to

[[[1,2,3],[1,2,3],[1,2,3],[1,2,3]], [[4,5,6],[4,5,6],[4,5,6],[4,5,6]]]shows that our valid lengths are copied along the head axis: this is what we want.

thanks for your reply @astonzhang

that’s a better choice than fix transpose_qkv func.

and maybe we use some other number in num_heads (different from batch_size=2 and num_steps=3) will make it easier to understand.

Good idea! Edited my post :slight_smile:

Thanks @astonzhang!

Yes, this is cleaner.

num_heads = 2
valid_len = np.array([1,2])
valid_lens = valid_lens.repeat(num_heads, axis=0)

(1 2) -> (1 1 2 2)

token order = (1 1 2 2)
valid_len order = (1 1 2 2)

In the transformer, the input (source) and output (target) sequence embeddings are added with positional encoding before being fed into the encoder and the decoder that stack modules based on self-attention. What is the benefit of the sequence embeddings being added with positional encoding before being fed into the encoder and the decoder? What about the case where the sequence embeddings are being added with positional encoding after being fed into the encoder and the decoder?

10.7.4. Encoder
Since we use the fixed positional encoding whose values are always between -1 and 1, we multiply values of the learnable input embeddings by the square root of the embedding dimension to rescale before summing up the input embedding and the positional encoding.

Why the embedding multiply the square root of the embedding dimension? Why rescale the values? Why the ratio is the square root of the embedding dimension?


Got the same doubt here. Here is the official pytorch example:

embedding = nn.Embedding(10, 3)
input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
tensor([[[-0.0251, -1.6902,  0.7172],
         [-0.6431,  0.0748,  0.6969],
         [ 1.4970,  1.3448, -0.9685],
         [-0.3677, -2.7265, -0.1685]],
        [[ 1.4970,  1.3448, -0.9685],
         [ 0.4362, -0.4004,  0.9400],
         [-0.6431,  0.0748,  0.6969],
         [ 0.9124, -2.3616,  1.1151]]])

The scaling method proposed is to multiply each element above by sqrt(3)≈1.732. It scales up a bit each element before added by the position encoding number, which is -1 ~ +1.

When predicting, target tokens are feed into the Decoder one by one. At this point the num_steps is considered to be 1, that is, each time you enter a tensor of the shape torch.Size([1, 1, num_hiddens]), this will result in the position encoding of each token being the same, i.e., the position encoding fails at prediction