Backpropagation Through Time

musashi14 · January 16, 2022, 2:20am

I think there should be a 1/2 term for both parts of the sum. That 1/2 term goes away when you keep going to the recurrence in 8.7.7.

yoya989 · April 5, 2022, 3:11pm

Hi,
The “+” sign is correct after applying the chain rule for one independent variable.
Suppose that

x = g(t) is differentiable function of t
z = f(x, t) is differentiable function of x and t

Then, z = f( g(t),t ) is differentiable function of t and its derivative is:

d(z)/d(t) = partial(z) / partial(x) . d(x)/d(t) + partial(z) / partial(y)

Back to the recurrent computation, as stating in the text that h_t depends on h_{t-1} and w_h while h_{t-1} also depends on w_h. Thus, if we see:

h_t as z
h_{t-1} as x
w_h as t

we will have the formula 8.7.4.

LW_Tuck · September 5, 2022, 5:52pm

I am also confused about this. Didn’t the first section discuss how dh_t / dW_hx needed to take into account the use of W_hx in the hidden states leading up to h_t?

leeauk21 · December 27, 2022, 1:04pm

Hi I think BPTT from 9.7.14 - 9.7.15 is wrong

ssjam331 · February 19, 2023, 6:39pm

I don’t know why the notation choice of this section is so bad.
Since h_t = f(x_t, h_(t-1), w_h), we can cancel the left term of Equation 8.7.4 with the first term of the right terms, making the last term of the right terms equal to zero.
All these nonsenses comes from the failed notation.
Although the authors state that the mathematical notation here does not explicitly distinguish between scalars, vectors, and matrices, this rather makes it hard to understand.

Xingqiu_He · April 25, 2023, 1:16pm

The author mis-used the total derivative with the partial derivative. The LHS of the equation should use total derivative symbol.
For chain rule with multi-variable functions, check (14.5.1) in the following link: 14.5: The Chain Rule for Multivariable Functions - Mathematics LibreTexts.

pandalabme · September 5, 2023, 2:57am

My solutions to the exs: 9.7

JH.Lam · October 13, 2023, 4:03pm

Great question . I found this issue too.
I think the first derivatives in section 8.7.1 is redundant to 8.7.2, and both are not consistent . so I mainly focus the latter

JH.Lam · October 13, 2023, 4:47pm

although it’s apparently wrong, I don’t think it can be fixed by simply replacing ‘+’ by ‘=’ . since Wh acts on both Xt and H[t-1] , that means this formula will loss one partial derivative w.r.t Xt. instead , I like the derivatives in section 8.7.2 .

Tianqi_Zhu · January 24, 2024, 4:38am

this formular is correct wh has dependency on xt, because we concat wt with ht-1, so this part try to calculate derivative base on current xt.
remember, this is just a intuition as suggested in the beginning of this chapter

no_name · June 28, 2024, 9:27pm

Sorry, I’m confused about something.

Why is that the derivative

depends on h_t and h_(t+1) only? If we perturb h_t, we change h_(t+1), but also h_(t+2), and so on. How can we stop at h_(t+1)? What am I missing here?

Thanks for the help.

LyricsGo · July 15, 2024, 12:48pm

depends on o_t and h_(t+1)，take h2 for example. you need to understand the chain

psl-schaefer · December 31, 2024, 10:47am

It has been noted by others, but equation 9.7.4 cannot be correct. I think this does not help at all to build intuition if there are such errors.