- the derivative of the tanh:

dtanh(x)/dx=1−tanh^(2)=

{2exp(−2x)-[exp(−2x)]^2}/(1+exp(−2x))^2.

https://www.math24.net/derivatives-hyperbolic-functions/

the derivative of the pReLU(x):

dpReLU(x)/dx = 1 (if x > 0);α (if x < 0);doesn’t exist (if x = 0) - h= ReLU(x) = max(x, 0)

y = ReLU(h) = max(h, 0) = max (x, 0) =ReLU(x)

h= pReLU(x)=max(0,x)+αmin(0,x).

y = pReLU(h) =max(0,h)+αmin(0,h)

One linear functions add,minus other linear function still is linear function. - [1−exp(−2x)]/[1+exp(−2x)]+1 = 2/[1+exp(−2x)] = 2 * 1/[1+exp(−2x)] = 2 sigmoid(2x).
- d/2 dimensions will cause linearly dependent?
- overfit?

The section for plotting the gradient relu function.

y.backward(torch.ones_like(x), retain_graph=True)

d2l.plot(x.detach(), x.grad, ‘x’, ‘grad of relu’, figsize=(5, 2.5))

Should there be a

y.backward(torch.ones_like(x), retain_graph=True)

d2l.plot(x.detach(), x.grad, ‘x’, ‘grad of relu’, figsize=(5, 2.5))

x.grad.data.zero_()

Else if we run the notebook twice the gradient will keep on adding

Sure @sushmit86 that’s a genuine concern and would be a good idea but we don’t want to add complexity to the book content.

If you feel the need to run the cells twice you can add the extra line to zero out the grads

Thanks so much for the reply

# Question 2:

I think it should be this:

- H = Relu(XW^(1) + b^(2))
- y = HW^(2) + b^(2)

**More detail in page 131.**

I think it is more easy to think like this:

Relu(x) constructs a continuous piecewise linear function for every x\in R. So, it do not depend on whatever x is providing that x is continuous in R. So, Relu(Relu(x)*W+b) for example is also constructs a continuous piecewise linear function.

# Question 4:

I think the most different between an MLP apply nonlinearity and MLP not apply nonlinearity is the time and complexity. In fact, MLPs applying nonlinearity such as Sigmoid and tanh are very expensive to calculate and find the derivative for gradient descent. So, we need something faster and Relu is a good choice to address these problem (6.x sigmoid).

Ans for question 4

I think if we apply different non linearity for different mini batches , as the activation function changes the first thing is the range of the output will vary which we affect the final output

My answers

### Exeercises

- Compute the derivative of the pReLU activation function.

- made a way to describe the function but the torch autograd is not able to work

`alpha = 0.1`

`y = find_max(X) + alpha * find_min(X)`

2. Show that an MLP using only ReLU (or pReLU) constructs a continuous piecewise linear

function.

- I guess we need to construct a multi layer perceptron here. dontknow.

- Show that tanh(x) + 1 = 2 sigmoid(2x).

- through plottinga graph we can show.

- Assume that we have a nonlinearity that applies to one minibatch at a time. What kinds of

problems do you expect this to cause?

- maybe this would create problems like each min batch would be squished(scaled) differently.