Gated Recurrent Units (GRU) - pytorch

Nov '20

Omri_Berman

Perhaps I’m missing something, but It looks like there’s a dimensionality disagreement:
Both products of X_tW_x and H_t-1W_h have shapes nxh, yet the biases have a shape 1xh.
Is there an implicit broadcasting being made for the bias terms to enable the summation?

1 reply

Nov '20 ▶ Omri_Berman

astonzhang

Yes. We also mentioned it in linear regression:

Nonetheless, I’ve just added such explanations: https://github.com/d2l-ai/d2l-en/commit/99b92a706b543cfee03b5f9cd874d4771c97cd37

Feb '21

six

For optimizing the hyperparams on question 2, do we need to perform k-fold validation (and thus augment train_ch8), or just try out different hypers strait into the train_ch8 function itself?

I think in theory at least it would be correct to optimizer our hypers via the use of hold-out right?

Apr '21

AbL

Possible typo at 10. Modern Recurrent Neural Networks — Dive into Deep Learning 1.0.3 documentation

Furthermore, we will expand the RNN architecture with a single undirectional hidden layer that has been discussed so far.

Should it be “unidirectional”?

1 reply

Apr '21

astonzhang

Yup indeed a typo. Thanks!

Sep '21

Tang

中文版9.1节这里应该是LSTM吧

Feb '22

imflash217

A good article on Convex Combinations (mentioned in Section 9.1.1.3 [Hidden State])

May '22

gopalakrishna-r

the influence of hyperparameters on running time and perplexity when run on “weights and biases”

1 reply

Sep '22 ▶ gopalakrishna-r

Oct '23

JH.Lam

since GRU has the power of mitigating gradients exploding, so why here still uses the old code block w/ grad clipping and detach()? how to reveal the value of GRU?

Dec '24 ▶ Xer12306

Riezmann75

Yes, it depends on your computing resource, the amount of training data, and the size of ur model

Omri_Berman

astonzhang

six

AbL

astonzhang

Tang

imflash217

gopalakrishna-r

sahilrajpal121

Reza_Rawassizadeh

pandalabme

Xer12306

JH.Lam

Riezmann75