In equation 9.4.8, the dimensions of O_t is mentioned as n x q. Is that a typo? Shouldn’t it be 1 x q (same as b_q)?
q outputs, and the batch is of size
n, and that’s why the output is
n * q.
The bias is of size
1 * q, but it’s broadcasted to
n * q during addition.