After reading Transformer paper. It is clear motivation to apply the stack of multi-head attention layers.
However, what is the reason behind the idea of using position-wise feed forward layer in Transformer?
It is quite unclear. Do you know the answer for that?
1 Like
@congvmit Not sure we have an exact explanation of the role of Position-wise FFNs. Still, note that we apply nonlinearity to the pooled embeddings most probably to increase the expressive power of the model—roughly, the same motivation as when we engineer tabular features and then pass them to 2-3 layer FFN to predict something. Also, remember that each FFN block has a residual connection—i.e., learns the residual mapping prior to Add&Norm. So it seems that one can alternatively try stacking several attention blocks without FFNs first, then regular Attention-FFN blocks toward the final layer (e.g., attn->attn->attn->[attn-ffn]->[attn-ffn]->[attn-ffn]->ffn->prediction
).