Correct me if i’m wrong, in (8.7.11), looks like the tanh activation was omitted, shouldn’t it be like below?
- z[t] = W_hh . h[t-1] + W_hx . x[t] + b
- h[t] = tanh(z[t])
Thus in (8.7.16), there is also a missing component of tanh derivative for each term in the sum which is product of (1 - tanh^2(z[i])) for i from t - j + 1 to t, i.e.
- (1 - h[t]^2)(1 - h[t - 1]^2)(1 - h[t - 2]^2) … (1 - h[t - j + 1]^2)
For reference, in Deepmind x UCL lecture on RNN, there is a similar formula with tanh component.
Hi @witx, great catch on the
tanh function. We will fix it!
As for the missing bias term, we omit it for the sake of simplicity (similarly to https://d2l.ai/chapter_multilayer-perceptrons/backprop.html). You can think W = [b, w1,w2,…], and X = [1, x1, x2, …].
Please correct me if I am wrong: in (8.7.16) if j runs from 1 to (t-1), shouldn’t the index be h_(j-1) instead of h_j from the previous equations (8.7.11). Would really appreciate the help!
Sure I would love to. Submitted a PR. Thank you!
We just heavily revised this section:
Specifically, to keep things simple we assume the activation function in the hidden layer uses the identity mapping.
Can I ask why 8.7.2 need to do 1/t?
is that because of average the gradient between timestamp or have other purposes?
Yes, just average the gradient so the scale of loss won’t be sensitive to the length of sequences
Correct me if I am wrong, but I think I found a minor mistake in the last term of equation 8.7.3.
In particular, I believe that the top equation should be replaced with the bottom one in this screenshot of mine:
It’s a typo. Thanks for fixing it!