In equation 9.4.8, the dimensions of O_t is mentioned as n x q. Is that a typo? Shouldn’t it be 1 x q (same as b_q)?

There are `q`

outputs, and the batch is of size `n`

, and that’s why the output is `n * q`

.

The bias is of size `1 * q`

, but it’s broadcasted to `n * q`

during addition.