I am trying to implementing my own RNN/GRU, but if I don’t reset the gradients (with .zero_grad()), the network is not converging.
Am I correct that in the implementation in chapter 8.5 the gradients are not reset, so it is not necessary since the hidden states are detached, or are they actually reset?