Gated Recurrent Units (GRU)

Perhaps I’m missing something, but It looks like there’s a dimensionality disagreement:
Both products of X_tW_x and H_t-1W_h have shapes nxh, yet the biases have a shape 1xh.
Is there an implicit broadcasting being made for the bias terms to enable the summation?

Yes. We also mentioned it in linear regression:

Nonetheless, I’ve just added such explanations:

For optimizing the hyperparams on question 2, do we need to perform k-fold validation (and thus augment train_ch8), or just try out different hypers strait into the train_ch8 function itself?

I think in theory at least it would be correct to optimizer our hypers via the use of hold-out right?

Possible typo at

Furthermore, we will expand the RNN architecture with a single undirectional hidden layer that has been discussed so far.

Should it be “unidirectional”?

Yup indeed a typo. Thanks!

1 Like


A good article on Convex Combinations (mentioned in Section [Hidden State])

the influence of hyperparameters on running time and perplexity when run on “weights and biases”

1 Like

This looks lovely, is this wandb?

I think there is a small problem with GRU visualization, -1 should be (x) to h_{t-1} and not the output of tanh gate.