Backpropagation Through Time

astonzhang · February 9, 2021, 8:09pm

It’s a typo. Thanks for fixing it!

DayongLin · June 7, 2021, 7:40am

formula 8.7.4，why is

here？while I think it’s = here.

zeyudu · June 23, 2021, 8:52am

can u explain the difference btw formula 8.7.4 and 8.7.16 . why in 8.7.16 we dont need to compute
∂ht/∂Whx like what u did in 8.7.4 recurrently ?

zhan_su · July 12, 2021, 11:53am

I think you misunderstood that. It is noted that:

So, the Derivative is related to the w_h and h_{t-1}

sushmit86 · July 30, 2021, 7:38pm

I had a question on Truncating Time Steps

Alternatively, we can truncate the sum in [(8.7.7)](http://d2l.ai/chapter_recurrent-neural-networks/bptt.html#equation-eq-bptt-partial-ht-wh-gen) after 𝜏 steps. This is what we have been discussing so far, such as when we detached the gradients in [Section 8.5](http://d2l.ai/chapter_recurrent-neural-networks/rnn-scratch.html#sec-rnn-scratch). This leads to an *approximation* of the true gradient,

So if I analyze the code in 8.5 are we detaching it at each time step?

#@save
def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
    """Train a net within one epoch (defined in Chapter 8)."""
    state, timer = None, d2l.Timer()
    metric = d2l.Accumulator(2)  # Sum of training loss, no. of tokens
    for X, Y in train_iter:
        if state is None or use_random_iter:
            # Initialize `state` when either it is the first iteration or
            # using random sampling
            state = net.begin_state(batch_size=X.shape[0], device=device)
        else:
            if isinstance(net, nn.Module) and not isinstance(state, tuple):
                # `state` is a tensor for `nn.GRU`
                state.detach_()
            else:
                # `state` is a tuple of tensors for `nn.LSTM` and
                # for our custom scratch implementation
                for s in state:
                    s.detach_()
        y = Y.T.reshape(-1)
        X, y = X.to(device), y.to(device)
        y_hat, state = net(X, state)
        l = loss(y_hat, y.long()).mean()
        if isinstance(updater, torch.optim.Optimizer):
            updater.zero_grad()
            l.backward()
            grad_clipping(net, 1)
            updater.step()
        else:
            l.backward()
            grad_clipping(net, 1)
            # Since the `mean` function has been invoked
            updater(batch_size=1)
        metric.add(l * y.numel(), y.numel())
    return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()

Yifei-Li-STA · August 22, 2021, 9:11pm

equation 8.7.16 seems to be wrong. dh(t)/dWhh != h(t-1), but equal to what’s expressed in equation 8.7.7. same for dh(t)/dWhx.
please correct me if there is some hidden assumption, thanks.

xiaojinghu93 · October 28, 2021, 7:46pm

I think it should be an “=” too.

JackChen · November 2, 2021, 2:25am

So do I. The first term is same as the second term.

Hadar933 · November 25, 2021, 4:22pm

perhaps he is referrind to dL/dw and not partial L / partial w

rimshaaa · December 25, 2021, 3:52pm

@goldpiggy Kindly guide me where I can find the solution to 8.7.4. Exercises. Thank you.

musashi14 · January 16, 2022, 2:20am

I think there should be a 1/2 term for both parts of the sum. That 1/2 term goes away when you keep going to the recurrence in 8.7.7.

yoya989 · April 5, 2022, 3:11pm

Hi,
The “+” sign is correct after applying the chain rule for one independent variable.
Suppose that

x = g(t) is differentiable function of t
z = f(x, t) is differentiable function of x and t

Then, z = f( g(t),t ) is differentiable function of t and its derivative is:

d(z)/d(t) = partial(z) / partial(x) . d(x)/d(t) + partial(z) / partial(y)

Back to the recurrent computation, as stating in the text that h_t depends on h_{t-1} and w_h while h_{t-1} also depends on w_h. Thus, if we see:

h_t as z
h_{t-1} as x
w_h as t

we will have the formula 8.7.4.

LW_Tuck · September 5, 2022, 5:52pm

I am also confused about this. Didn’t the first section discuss how dh_t / dW_hx needed to take into account the use of W_hx in the hidden states leading up to h_t?

leeauk21 · December 27, 2022, 1:04pm

Hi I think BPTT from 9.7.14 - 9.7.15 is wrong

ssjam331 · February 19, 2023, 6:39pm

I don’t know why the notation choice of this section is so bad.
Since h_t = f(x_t, h_(t-1), w_h), we can cancel the left term of Equation 8.7.4 with the first term of the right terms, making the last term of the right terms equal to zero.
All these nonsenses comes from the failed notation.
Although the authors state that the mathematical notation here does not explicitly distinguish between scalars, vectors, and matrices, this rather makes it hard to understand.

Xingqiu_He · April 25, 2023, 1:16pm

The author mis-used the total derivative with the partial derivative. The LHS of the equation should use total derivative symbol.
For chain rule with multi-variable functions, check (14.5.1) in the following link: 14.5: The Chain Rule for Multivariable Functions - Mathematics LibreTexts.

pandalabme · September 5, 2023, 2:57am

My solutions to the exs: 9.7

JH.Lam · October 13, 2023, 4:03pm

Great question . I found this issue too.
I think the first derivatives in section 8.7.1 is redundant to 8.7.2, and both are not consistent . so I mainly focus the latter

JH.Lam · October 13, 2023, 4:47pm

although it’s apparently wrong, I don’t think it can be fixed by simply replacing ‘+’ by ‘=’ . since Wh acts on both Xt and H[t-1] , that means this formula will loss one partial derivative w.r.t Xt. instead , I like the derivatives in section 8.7.2 .

Tianqi_Zhu · January 24, 2024, 4:38am

this formular is correct wh has dependency on xt, because we concat wt with ht-1, so this part try to calculate derivative base on current xt.
remember, this is just a intuition as suggested in the beginning of this chapter