so, where is the code which have the function of detaching the gradient
@terrytangyuan, in TF do we need to use https://www.tensorflow.org/guide/advanced_autodiff#stop_gradient ?
Why tensorflow version’s PPL keeps so high and bumpy? even sets lr=0.0001 and uses Adam optimizer?
Something goes wrong?
I have fixed the bug, Just transpose Y accordingly (because we have transposed X):
Then the training result is normal! (perplexity =1.0)
great. PR please: http://preview.d2l.ai/d2l-en/master/chapter_appendix-tools-for-deep-learning/contributing.html
It was a nightmare when I read this chapter’s source code. Why did you make things so complicated?