Backpropagation Through Time

I think there should be a 1/2 term for both parts of the sum. That 1/2 term goes away when you keep going to the recurrence in 8.7.7.

Hi,
The “+” sign is correct after applying the chain rule for one independent variable.
Suppose that

  • x = g(t) is differentiable function of t
  • z = f(x, t) is differentiable function of x and t

Then, z = f( g(t),t ) is differentiable function of t and its derivative is:

  • d(z)/d(t) = partial(z) / partial(x) . d(x)/d(t) + partial(z) / partial(y)

Back to the recurrent computation, as stating in the text that h_t depends on h_{t-1} and w_h while h_{t-1} also depends on w_h. Thus, if we see:

  • h_t as z
  • h_{t-1} as x
  • w_h as t

we will have the formula 8.7.4.

1 Like

I am also confused about this. Didn’t the first section discuss how dh_t / dW_hx needed to take into account the use of W_hx in the hidden states leading up to h_t?

Hi I think BPTT from 9.7.14 - 9.7.15 is wrong

I don’t know why the notation choice of this section is so bad.
Since h_t = f(x_t, h_(t-1), w_h), we can cancel the left term of Equation 8.7.4 with the first term of the right terms, making the last term of the right terms equal to zero.
All these nonsenses comes from the failed notation.
Although the authors state that the mathematical notation here does not explicitly distinguish between scalars, vectors, and matrices, this rather makes it hard to understand.

The author mis-used the total derivative with the partial derivative. The LHS of the equation should use total derivative symbol.
For chain rule with multi-variable functions, check (14.5.1) in the following link: 14.5: The Chain Rule for Multivariable Functions - Mathematics LibreTexts.

1 Like

My solutions to the exs: 9.7

1 Like

Great question . I found this issue too.
I think the first derivatives in section 8.7.1 is redundant to 8.7.2, and both are not consistent . so I mainly focus the latter

although it’s apparently wrong, I don’t think it can be fixed by simply replacing ‘+’ by ‘=’ . since Wh acts on both Xt and H[t-1] , that means this formula will loss one partial derivative w.r.t Xt. instead , I like the derivatives in section 8.7.2 .

this formular is correct wh has dependency on xt, because we concat wt with ht-1, so this part try to calculate derivative base on current xt.
remember, this is just a intuition as suggested in the beginning of this chapter