In equation 9.4.8, the dimensions of O_t is mentioned as n x q. Is that a typo? Shouldn’t it be 1 x q (same as b_q)?
There are q
outputs, and the batch is of size n
, and that’s why the output is n * q
.
The bias is of size 1 * q
, but it’s broadcasted to n * q
during addition.