Gated Recurrent Units (GRU)

astonzhang · September 17, 2020, 4:40am

https://d2l.ai/chapter_recurrent-modern/gru.html

Omri_Berman · November 27, 2020, 5:19pm

Perhaps I’m missing something, but It looks like there’s a dimensionality disagreement:
Both products of X_tW_x and H_t-1W_h have shapes nxh, yet the biases have a shape 1xh.
Is there an implicit broadcasting being made for the bias terms to enable the summation?

astonzhang · November 27, 2020, 11:47pm

Yes. We also mentioned it in linear regression:

Nonetheless, I’ve just added such explanations: https://github.com/d2l-ai/d2l-en/commit/99b92a706b543cfee03b5f9cd874d4771c97cd37

six · February 27, 2021, 1:22am

For optimizing the hyperparams on question 2, do we need to perform k-fold validation (and thus augment train_ch8), or just try out different hypers strait into the train_ch8 function itself?

I think in theory at least it would be correct to optimizer our hypers via the use of hold-out right?

AbL · April 11, 2021, 2:15pm

Possible typo at 10. Modern Recurrent Neural Networks — Dive into Deep Learning 1.0.3 documentation

Furthermore, we will expand the RNN architecture with a single undirectional hidden layer that has been discussed so far.

Should it be “unidirectional”?

astonzhang · April 21, 2021, 5:43am

Yup indeed a typo. Thanks!

Tang · September 3, 2021, 6:27am

中文版9.1节这里应该是LSTM吧

imflash217 · February 6, 2022, 2:41am

A good article on Convex Combinations (mentioned in Section 9.1.1.3 [Hidden State])

gopalakrishna-r · May 8, 2022, 11:51pm

the influence of hyperparameters on running time and perplexity when run on “weights and biases”

sahilrajpal121 · September 14, 2022, 2:19pm

This looks lovely, is this wandb?

Reza_Rawassizadeh · November 29, 2022, 1:06am

I think there is a small problem with GRU visualization, -1 should be (x) to h_{t-1} and not the output of tanh gate.

pandalabme · September 5, 2023, 9:20am

My solutions to the exs: 10.2

Xer12306 · September 20, 2023, 12:41pm

when I want to run the code which trains the model,I find it works so slow(about 2 hours)
is that normal?

JH.Lam · October 18, 2023, 9:15am

since GRU has the power of mitigating gradients exploding, so why here still uses the old code block w/ grad clipping and detach()? how to reveal the value of GRU?

Riezmann75 · December 1, 2024, 10:49am

Yes, it depends on your computing resource, the amount of training data, and the size of ur model