Jun '20

StevenJokes

I noticed that 2.5.2 does’t have PyTorch, so I tried to convert it.

I used “.detach()” and “.clone()” to simulate “.copy()” in MXNET.

But I’m confused with differences from “.detach()” and “.clone()”.

Which one is more like to “.copy()” in MXNET?


from torch.autograd import Variable

x = torch.arange(4.0, requires_grad=True)

# make it as a Variable with a gradient taken.

x =  torch.autograd.Variable(x, requires_grad=True)

y = x * x

y.backward(torch.ones(y.size()))

AttributeError: ‘Tensor’ object has no attribute ‘copy’


# detach

u = x.detach()

# replace :u = torch.autograd.Variable(u, requires_grad=True)

# make tensor autograd works

u.requires_grad()

v = u * u

v.backward(torch.ones(v.size()))

x.grad == u.grad

tensor([True, True, True, True])


# clone

u = x.clone()

u.requires_grad()

v = u * u

v.backward(torch.ones(v.size()))

x.grad == u.grad

tensor([True, True, True, True])

For more:

Links that I have searched:

github.com

StevenJokes/D2L_enread/blob/master/Chapter2/2-5-2.md



<!--
 * @version:
 * @Author: steven
 * @Date: 2020-06-11 19:06:46
 * @LastEditors: steven
 * @LastEditTime: 2020-06-11 19:23:57
 * @Description:
-->
I noticed that 2.5.2 does't have PyTorch, so I tried to convert it.

I used ".detach()" and ".clone()" to simulate  ".copy()" in MXNET.

But I'm confused with differences from ".detach()" and ".clone()".

Which one is more like to ".copy()" in MXNET?

```python
from torch.autograd import Variable

This file has been truncated. show original

Jun '20

2.5.4.

“Consequently d / a allows us to verify that the gradient is correct.”
I didn’t understand the meaning of d / a.
So I code as follow. But now, I’m more confused.

b = torch.randn(size=(1,), requires_grad=True)
b = b + 1000
# record the calculation
d = f(b)
d.backward()
b.grad == (d / b)

False

d tensor([1999.6416], grad_fn=)
b tensor([999.8208], grad_fn=)

t = b.grad
t

no return?

2.5.5.

The MXnet code of prediction mode haven’t mentioned.
I was wondering about the PyTorch’s code.

Jun '20

StevenJokes

I’m not very clear. Maybe second derivative is more easy to overflow the scope of storages.


y.backward()

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.


y = 2 * torch.dot(x, x)

y

y.backward(retain_graph=True)

y.backward(retain_graph=True)

no Error!

Error happened!


# matrix

a = torch.randn(20, requires_grad=True).reshape(5, 4)

d = f(a)

d.backward()

RuntimeError: grad can be implicitly created only for scalar outputs

analyze:

source code:~.conda\envs\pytorch\lib\site-packages\torch\autograd_init_.py in _make_grads(outputs, grads)

rows:from 32 to 35 :


        elif grad is None: # why grad is None? Does f(a) not support matrix?

            if out.requires_grad:

                if out.numel() != 1:

                    raise RuntimeError("grad can be implicitly created only for scalar outputs")

TODO:
I’m trying to import plot in 2.4, so I convert the “2-4.ipynb” to “two-four.py”.

But when I import it, it just happens as follow.
How to fix it?


from ..d2l import torch as d2l

from IPython import display

import numpy as np

from .two_four import plot

ImportError Traceback (most recent call last)

in

----> 1 from …d2l import torch as d2l

  2 from IPython import display

  3 import numpy as np

  4 from .two_four import plot

ImportError: attempted relative import with no known parent package

TODO:

1 reply

Jun '20 ▶ StevenJokes

anirudh

Can you explain what exactly is the doubt here?
To import -> from d2l import torch as d2l you first need to install the d2l package in your environment.

Jun '20

StevenJokes

The first reply is my 2.5.2’s pytorch code. I use “detach” and “clone” to simulate what “copy” do in MXNET. But which one is best, “detach” or “clone”?

The second reply is about "the meaning of d / a " in 2.5.4. I tried another “b”.The return of “b.grad == (d / b)” is false rather than true. why?

The third reply is my answer to 2.5.7. I’m still working on how to import plot in 2.4.

1 reply

Jun '20

Answer to first question.
tensor.detach() creates a tensor that shares the same storage with tensor that does not require grad.
But tensor.clone() will also give you original tensor’s requires_grad attributes. It is basically an exact copy including the computation graph.

Use detach() to remove a tensor from computation graph and use clone to copy the tensor while still keeping the copy as a part of the computation graph it came from.

The second answer about "the meaning of d / a " in 2.5.4.
“a.grad == (d / a)” is true because if you see how d is calculate using f(a). It is basically scaling a by some constant factor k. And if you were to do differentiate such a function say function d=k*a with respect to a then you would get that k. Hence these are true and obviously it won’t hold for b because d is a function of which scales a and not b.

Answer to third question.
As i suggested earlier and you probably did it, you can simply pip install d2l to import torch from d2l.
If you want to import the specific rename imtorch.py just add this to the start of your code before making the import. ->

import sys
sys.path.insert(0, "../")

Let me know if these answers solve your issue.

1 reply

Jun '20 ▶ anirudh

StevenJokes

Q1.
Does it mean that if I ‘clone’ a tensor or a Variable that has requires_grad attribute, then I don’t need to .requires_grad() for the new one?

Use detach() to remove a tensor? However, why can I still judge x.grad() == u.grad? “Remove” doesn’t mean that x will not exist? I think that x and u are just two different names for the same storage.

Can my code about 2.5.2’s Variable add to 2.5.2?

Q2.
I have understand that d is a function of which scales a. But what difference with f(b)?
Because, I think that f(b) = 2 * b.

First, b(formal parameter) =2 * a (argument b).
Then, not enter the loop. (b > 1000)
Then, True for if, return b(formal parameter) which is 2 * b(argument)

Q3.
Thanks. I will try it next time. I have did pip install d2l
The specified module could not be found
Did the problem happen because the module doesn’t exist in my sys.path?

1 reply

Jun '20 ▶ StevenJokes

anirudh

Ans1.
Yes we don’t use Variable in PyTorch now. We can use Tensor to do everything a variable did earlier with the latest version.

Ans2.
Yes, f(b) will be 2*b and if you change it to the following you will get True.

b = torch.randn(size=(1,), requires_grad=True)
d=f(b)
d.backward()

b.grad == d/b
>>>True

Ans3. Just uninstall the pip version and run python setup.py develop in your root d2l-en directory for installing the library.

2 replies

Jun '20 ▶ anirudh

anirudh

Ans3.
Just run this python setup.py develop in your repository.

Jun '20 ▶ anirudh

StevenJokes

Q2.

b = b + 1000

Why does it make False?
I found that print(b.grad) is None?
Why did it happen?

Aug '20

ccpvirus

my solution to question 5

import numpy as np
from d2l import torch as d2l
x = np.linspace(- np.pi,np.pi,100)
x = torch.tensor(x, requires_grad=True)
y = torch.sin(x)
for i in range(100):
y[i].backward(retain_graph = True)

d2l.plot(x.detach(),(y.detach(),x.grad),legend = ((‘sin(x)’,“grad w.s.t x”)))

looks dummy since I compute the grad 100 times, is there any better way?

3 replies

Aug '20

SONG_PAN

I don’t quite understand 2.5.2.

My understanding is that by default the framework will convert the actual output matrix into a vector based on a “gradient vector” that we passed in. The description of this “gradient vector” is

which specifies the gradient of the differentiated function w.r.t self.

What does it mean? Does “gradient of differentiated function” mean “second order gradient”?

1 reply

Aug '20

SONG_PAN

My answers to the questions: please point out if I am misunderstood anything

Why is the second derivative much more expensive to compute than the first derivative?

Because instead of following the original computation graph, we need to construct a new one that corresponds to the calculation of first-order gradient?

After running the function for backpropagation, immediately run it again and see what happens.

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

In the control flow example where we calculate the derivative of d with respect to a , what would happen if we changed the variable a to a random vector or matrix. At this point, the result of the calculation f(a) is no longer a scalar. What happens to the result? How do we analyze this?

Directly changing would give the “blah blah … only scalar output” which is what 2.5.2 is talking about. I changed the code to

a = torch.randn(size=(3,), requires_grad=True)
d = f(a)
d.sum().backward()

and a.grad == d / a still gives true. This is because in the function, there is no cross-element operation. So each element of the vector is independent and doesn’t affect other element’s differentiation.

Redesign an example of finding the gradient of the control flow. Run and analyze the result.

SKIP

Let $f(x) = \sin(x)$. Plot $f(x)$ and $\frac{df(x)}{dx}$, where the latter is computed without exploiting that $f’(x) = \cos(x)$.

import numpy as np
from d2l import torch as d2l
x = np.linspace(- np.pi,np.pi,100)
x = torch.tensor(x, requires_grad=True)
y = torch.sin(x)
y.sum().backward()

d2l.plot(x.detach(),(y.detach(),x.grad),legend = (('sin(x)',"grad w.s.t x")))

Aug '20

goldpiggy

Hi @SONG_PAN, this is a first order gradient. The gradient is only available after we differentiate the function.

1 reply

Nov '20 ▶ goldpiggy

Aaron_L

Hi, May I confirm my understanding about the auto differentiation of Python control flow?
I assume, as long as the Python control flow is established by functions and variables from Pytorch, then the auto differentiation is doable. Am I right on this? No other functions should be involved such as sin() from math package(have to use sin() from Pytorch instead if I want auto differentiation).

1 reply

Nov '20

StevenJokess

github.com

StevenJokess/d2l-en-read/blob/29574f604221e53149aa71288d56e5de5d7a34cd/discuss/Ch02/Auto Differentiation.ipynb

{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  }
 },

This file has been truncated. show original

I tried. Did you mean this? @Aaron_L

Feb '21 ▶ ccpvirus

Lefan

I guess you can replace
for i in range(100):
y[i].backward(retain_graph = True)

with the following sentence:
y.sum().backward(retain_graph = True)

Mar '21

gary

Anyone know how the backpropagation of CNN and RNN works? Are there any step by step tutorials to show the derivation? Thanks!

1 reply

Mar '21 ▶ gary

StevenJokess

@gary

“Searching” is the best way to solve common questions like this.
Do you really want to learn? Or you only want to ask to act like a learner.
CNN: TODO:for you
RNN: http://preview.d2l.ai/d2l-en/master/chapter_recurrent-neural-networks/bptt.html

May '21 ▶ ccpvirus

JackFu123

How about:
x = np.linspace(- np.pi,np.pi,100)
x = torch.tensor(x, requires_grad=True)
y = torch.sin(x)
z = y.sum()
z.backward()

d2l.plot(x.detach(),(y.detach(),x.grad),legend = (('sin(x)','grad w.s.t x')))

May '21

Luis

My solution to the last exercise

import torch
import matplotlib.pyplot as plt
import numpy as np
import math
%matplotlib inline

x = torch.arange(0, 2*math.pi, 0.1)
x.requires_grad_(True)
y = torch.sin(x).sum().backward() #Derivando y respecto a x

dy_dx = x.grad

x = x.detach().numpy()

y = np.sin(x)
dy_dx = dy_dx.numpy()

fig = plt.figure()
plt.plot(x, y, ‘-’, label = ‘y = sin(x)’)
plt.plot(x, dy_dx, ‘–’, label =’ dy_dx = cos(x)’)
plt.axis(‘equal’)
leg = plt.legend();

Jul '21

CE_I

My solution to the last exercise

Aug '21 ▶ ccpvirus

Mingming_Li

x = torch.arange(-np.pi, np.pi, 0.1, requires_grad=True)
y = torch.sin(x)
y.sum().backward()
x.grad

Nov '21

pbouzon

Exercise 1. The second derivative is much more expensive because, for a function f with n variables, the Hessian Matrix has n² elements while the gradient vector has n elements. In other words, there are n² possible second order derivatives ( ∂²f/∂x1², ∂²f/∂x1∂x2, …) while there are n possible first order derivatives (∂f/∂x1, ∂f/∂x2, …, ∂f/∂xn).

Apr '22

Donald_Slowik

Backward for Non-Scalar Variables:

Seems, in the text, by calculating y.backward(torch.ones(len(x))), we contract the Jacobian over the 4 components of y(x). The Jacobian is a 4x4 matrix. I can extract each column via y.backward(torch.tensor([1,0,0,0])), #where i shift the 1 to each of the 4 column slots.
Is this best way to calculate the Jacobian?

Aug '22

GpuTooExpensive

My anwser for ex 2 3 4，welcom to tell me if you find any error

'''
ex.2
Aafter running the function for backpropagation, immediately run it again and see what hap-
pens. Why?

reference:
[https://blog.csdn.net/rothschild666/article/details/124170794](https://blog.csdn.net/rothschild666/article/details/124170794)
'''
import torch
x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
y = 2 * torch.dot(x, x)
y.backward(retain_graph=True)#need to set para = Ture
y.backward()
x.grad

out for ex2:
tensor([ 0., 8., 16., 24.])

'''
ex.3
In the control flow example where we calculate the derivative of d with respect to a, what
would happen if we changed the variable a to a random vector or a matrix? At this point, the
result of the calculation f(a) is no longer a scalar. What happens to the result? How do we
analyze this?


reference:
[https://www.jb51.net/article/211983.htm](https://www.jb51.net/article/211983.htm)
'''
import torch
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c
a = torch.randn(size=(2,3), requires_grad=True)
#a = torch.tensor([1.0,2.0], requires_grad=True)
print(a)
d = f(a)
d.backward(d.detach())#This is neccesary, I don't know why, is it because there are b and c in the function?
#d.backward()#wrong
print(a.grad)

out for ex3:
tensor([[-0.6848, 0.2447, 1.5633],
[-0.1291, 0.2607, 0.9181]], requires_grad=True)
tensor([[-179512.9062, 64147.3203, 409814.3750],
[ -33844.5430, 68346.7969, 240686.7344]])

'''
ex.4
Let f (x) = sin(x). Plot the graph of f and of its derivative f ′ . Do not exploit the fact that
f ′ (x) = cos(x) but rather use automatic differentiation to get the result.
'''
#suppose the functions in 2.4.2Visualization Utilities has been constructed in d2l
from d2l import torch as d2l
x = d2l.arange(0,10,0.1, requires_grad = True)
y = d2l.sin(x)
y.backward(gradient=d2l.ones(len(y)))#y.sum().backward() can do the same thing
#.detach.numpy() is needed because x is set to requires_grad = True
d2l.plot(x.detach().numpy(), [y.detach().numpy(),x.grad], 'x', 'f(x)', legend = ['sin(x)','sin\'(x)'])

out for ex4:

Dec '22

Denis_Kazakov

This link https://mblondel.org/teaching/autodiff-2020.pdf may help to answer the 8th question.

Jan '23

Shivam_Dixit

Answer for 4th question : Google Colab

Mar '23

R8a

Can someone please explain the result of 2.5.1 to me. I understand how the first function is just y = 2x^2, but creating the new function that is y = x.sum(), just returns 6. Is this conceptually 6x or is it just 6 or is this just x (i’m assuming the third because that’s the only way we can get 1 as the gradient all the time) or am i missing something?

1 reply

Apr '23 ▶ R8a

Roy_Zhou

My understanding about 2.5.1’s pytorch code blocks:

y = x.sum() is just a way of defining a computational graph in PyTorch, which means evaluating each component of x and adding them up. Gradients are computed on each component of x, NOT on the y graph. Evaluating gradient on a component of x means computing for y_i = x_i (which yields 1).

The same principal applies to the y = 2 * x^Tx part, y is NOT 2x^2, it’s a computational graph for evaluating 2 * x^Tx where each component of it is actually y_i = 2 * x_i * x_i. So the graph y is in fact sum { 2 * x_i * x_i }.

HTH

May '23

Ashkin

Can someone help with Questions 5 and 6. Is this a pen-paper question, writing down backprop for the computation graph, or is this supposed to be implemented (if yes, then can someone elaborate more on it or provide a solution.)

Also problems 7, 8 are difficult ones, but at least there is a good resource provided by Denis Kazakov.

Jul '23

Lina_Mo

Hi! In Exercise 5, I don’t understand why my solution and the expected derivative do not correspond:
%matplotlib inline
import numpy as np
from matplotlib_inline import backend_inline
from d2l import torch as d2l
import math

x = torch.linspace(0.1, 10, 100)

x.requires_grad_(True)

def f(x):
return ((torch.log(x ** 2)) * torch.sin(x)) + (x ** -1)
y = f(x)
y.backward(torch.ones_like(y))
p = x.grad

Plot the function and its derivative

plt.plot(x.detach().numpy(), y.detach().numpy(), label=‘f(x)’)
plt.plot(x.detach().numpy(), p.detach().numpy(), label=“f’(x)”)
plt.xlabel(‘x’)
plt.ylabel(‘y’)
plt.legend()
plt.grid(True)
plt.show()
gives me this plot:

and when I compare
desired_derivative = -(2 / x) * torch.sin(x) + torch.log(x**2) * torch.cos(x) - (1/x **2)
p == desired_derivative
it evaluates to False.
Can you help me?

1 reply

Jul '23

cclj

Ex2.

import torch
x = torch.arange(4.0)
x.requires_grad_(True)
y = x @ x

y.backward()

Output (second run):

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_82/1055444503.py in <module>
----> 1 y.backward()

/opt/miniconda/lib/python3.7/site-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    394                 create_graph=create_graph,
    395                 inputs=inputs)
--> 396         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    397 
    398     def register_hook(self, hook):

/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    173     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    174         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 175         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
    176 
    177 def grad(

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

The .backward() call will automatically free all the intermediate tensors in a computational graph after the backprop.
In that sense, “double backward” can not be realized unless one specifies retain_graph=True.

Ex3.

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

a = torch.randn(size=(4, ), requires_grad=True)
d = f(a)
d.backward()

Output:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_82/535593037.py in <module>
      1 a = torch.randn(size=(4, ), requires_grad=True)
      2 d = f(a)
----> 3 d.backward()

/opt/miniconda/lib/python3.7/site-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    394                 create_graph=create_graph,
    395                 inputs=inputs)
--> 396         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    397 
    398     def register_hook(self, hook):

/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    164 
    165     grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
--> 166     grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
    167     if retain_graph is None:
    168         retain_graph = create_graph

/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py in _make_grads(outputs, grads, is_grads_batched)
     65             if out.requires_grad:
     66                 if out.numel() != 1:
---> 67                     raise RuntimeError("grad can be implicitly created only for scalar outputs")
     68                 new_grads.append(torch.ones_like(out, memory_format=torch.preserve_format))
     69             else:

RuntimeError: grad can be implicitly created only for scalar outputs

Since a as well as d is no longer a scalar, we need to reduce d to make backprop feasible.

d.sum().backward()
a.grad == d / a

Output:

tensor([True, True, True, True])

Ex4-5.

The dependency graph (computational graph):

Applying the backprop algorithm according to the chain rule…
- The + operation y = n1 + n2: n1.grad = 1, n2.grad = 1
- The inverse operation n1 = 1 / x: x.grad1 = (- 1 / x ** 2) * n1.grad = - 1 / x ** 2
- The * operation n2 = n3 * n4: n3.grad = n4 * n2.grad = log(x ** 2), n4.grad = n3 * n2.grad = sin(x)
- The sin operation n3 = sin(x): x.grad2 = cos(x) * n3.grad = cos(x) * log(x ** 2)
- The log operation n4 = log(n5): n5.grad = 1 / n5 * n4.grad = 1 / x ** 2 * sin(x)
- The square operation n5 = x ** 2: x.grad3 = 2 * x * n5.grad = 2 / x * sin(x)
- Final result: x.grad = x.grad1 + x.grad2 + x.grad3 = - 1 / x ** 2 + cos(x) * log(x ** 2) + 2 / x * sin(x)

import torch

torch.manual_seed(713)
x = torch.randn(size=(), requires_grad=True)
n1, n3, n5 = 1 / x, torch.sin(x), x ** 2
n4 = torch.log(n5)
n2 = n3 * n4
y = n1 + n2
y.backward()

x.grad, - 1 / x ** 2 + torch.cos(x) * torch.log(x ** 2) + 2 / x * torch.sin(x)

Output:

(tensor(-1.7580), tensor(-1.7580, grad_fn=<AddBackward0>))

Mar '24

kyunghee_cha

Q5,6

Jul '24 ▶ Lina_Mo

Preetam_Singh

I’m probably too late for you, but posting here for posterity.
Firstly, you need to use torch.log10(x**2) in your f(x) function, as just “log” function is for natural logarithm, and we’re given “log base 10” in question. And for the same reason I guess, you’ve mistakenly calculated derivative considering natural log of x squared. I’m leaving the code below with correct derivative.

from d2l import torch as d2l

x = torch.linspace(0.1, 10, 100)

x.requires_grad_(True)

def f(x):
    return ((torch.log10(x ** 2)) * torch.sin(x)) + (x ** -1)
y = f(x)
y.backward(torch.ones_like(y))
p = x.grad

d2l.plot(x.detach(), [y.detach(), p], 'x', 'f(x)', legend=['f(x)', 'df/dx'])
z=4
desired_derivative = (2 / (x * torch.log(torch.tensor(10)))) * torch.sin(x) + torch.log10(x**2) * torch.cos(x) - (1/x **2)
print(f"{(p[z]-desired_derivative[z]).detach().numpy():.18f}")
p == desired_derivative, p[z], desired_derivative[z], p[z] == desired_derivative[z]

You might get some “False” values when comparing AD and desired derivative results, but those are just because of unreliable floating point math of computers.

Aug '24

mhermher

One more diagram for 2.5.6.5 and 2.5.6.6.

I had to get really explicit with the steps but this helped me out, maybe it’ll help someone else out too.

13 Jan

filipv

Assuming we’re talking about the derivative of a scalar with respect to a tensor, the first derivative will be of the same size as the tensor. But the second derivative will be of that size, squared! Further, autodiff only requires a single backward pass to find the gradients - but if we want to find the full Hessian, we then need to find the derivative of each component of our initial gradients. So you’d have to a full backwards pass for every parameter, or find a more efficient way of computing this.
When you run .backward() multiple times, the gradients accumulate (are added to each other).
The shape of f(a) will match the shape of a, so if a is a vector or matrix, we’ll need to call something like .sum().backward() or .backward(torch.ones_like(d)). Once we do, a.grad == d / a will yield a tensor of the same shape as a, with all True values.
See the code below:

import torch
import matplotlib.pyplot as plt
x = torch.linspace(-5, 5, 100, requires_grad=True)
y = torch.sin(x)
y.sum().backward()
plt.plot(x.detach(), y.detach())
plt.plot(x.detach(), x.grad.detach())
plt.legend(["sin(x)", "dy/dx sin(x)"])
print((x.grad == torch.cos(x)).all())

Questions 5-7

I did these on paper by hand. For anyone having a hard time with questions 5-7, I recommend working through this tutorial from Andrej Karpathy.

Question 8

I tried doing forwardprop by hand on a number of computational graphs. For each input, one has to perform a full forward pass through the graph. In contexts where we’d like to track the gradient of some output with respect to one or a few specific inputs, forwardprop makes sense! It would also make sense in any context where we have N inputs and M outputs, and N << M. In practice, neural networks have a massive number of inputs and a scalar output loss, making backwards differentiation the obvious choice. A rule of thumb is that the forward-mode cost is proportional to the number of inputs, and the backward-mode cost is proportional to the number of outputs.

Other tradeoffs may come up in practice, depending on the layout of your computational graph. For example, backprop starts off at a scalar, and the computational graph fans out from there. You might think this makes it difficult to parallelize early parts of the graph, but in practice, backprop parallelizes very well. One might think that forwardprop would be easier to parallelize (since you can start out propogating from all inputs in parallel), but you could get weird dependency bottlenecks as different branches have to be “merged” deeper into the network: if branch A from input a and branch B from input b must be multiplied together, and branch A is very fast to propogate through but branch B is very slow to propogate through, branch A will be bottlenecked by branch B. Generally, you’ll be bottlenecked by the slowest path through the network.

Another tradeoff is memory usage: backprop requires storing intermediate activations from the forward pass, meaning memory usage scales with the depth of a network. Forwardprop doesn’t have this requirement! Aside from tracking the partial derivatives at each step, forwardprop stateless as we move through the network, meaning the memory footprint is generally much smaller.

19 Feb

KinomotoMio

I have provided a diagram created using Mermaid that illustrates the detaching computation sections.
The solid lines represent the forward computation path, while the dashed lines indicate the backpropagation path.
I hope this helps you better understand the workflow of this functionality. Please let me know if there are any issues or corrections needed.
分离计算阻断传播的示意图