- the derivative of the tanh:

dtanh(x)/dx=1−tanh^(2)=

{2exp(−2x)-[exp(−2x)]^2}/(1+exp(−2x))^2.

https://www.math24.net/derivatives-hyperbolic-functions/

the derivative of the pReLU(x):

dpReLU(x)/dx = 1 (if x > 0);α (if x < 0);doesn’t exist (if x = 0) - h= ReLU(x) = max(x, 0)

y = ReLU(h) = max(h, 0) = max (x, 0) =ReLU(x)

h= pReLU(x)=max(0,x)+αmin(0,x).

y = pReLU(h) =max(0,h)+αmin(0,h)

One linear functions add,minus other linear function still is linear function. - [1−exp(−2x)]/[1+exp(−2x)]+1 = 2/[1+exp(−2x)] = 2 * 1/[1+exp(−2x)] = 2 sigmoid(2x).
- d/2 dimensions will cause linearly dependent?
- overfit?

The section for plotting the gradient relu function.

y.backward(torch.ones_like(x), retain_graph=True)

d2l.plot(x.detach(), x.grad, ‘x’, ‘grad of relu’, figsize=(5, 2.5))

Should there be a

y.backward(torch.ones_like(x), retain_graph=True)

d2l.plot(x.detach(), x.grad, ‘x’, ‘grad of relu’, figsize=(5, 2.5))

x.grad.data.zero_()

Else if we run the notebook twice the gradient will keep on adding

Sure @sushmit86 that’s a genuine concern and would be a good idea but we don’t want to add complexity to the book content.

If you feel the need to run the cells twice you can add the extra line to zero out the grads

Thanks so much for the reply

# Question 2:

I think it should be this:

- H = Relu(XW^(1) + b^(2))
- y = HW^(2) + b^(2)

**More detail in page 131.**

I think it is more easy to think like this:

Relu(x) constructs a continuous piecewise linear function for every x\in R. So, it do not depend on whatever x is providing that x is continuous in R. So, Relu(Relu(x)*W+b) for example is also constructs a continuous piecewise linear function.

# Question 4:

I think the most different between an MLP apply nonlinearity and MLP not apply nonlinearity is the time and complexity. In fact, MLPs applying nonlinearity such as Sigmoid and tanh are very expensive to calculate and find the derivative for gradient descent. So, we need something faster and Relu is a good choice to address these problem (6.x sigmoid).

Ans for question 4

I think if we apply different non linearity for different mini batches , as the activation function changes the first thing is the range of the output will vary which we affect the final output