Q1.2. Compute the variance of the distribution given by softmax(π¨)softmax(o) and show that it matches the second derivative computed above.
Can someone point me in the right direction? I tried to use Var[π]=πΈ[(πβπΈ[π])^2]=πΈ[π^2]βπΈ[π]^2 to find the variance but I ended up having the term 1/q^2β¦ it doesnβt look like the second derivative from Q1.1.
Show:
β a bβ
logβπβ β― + πβ β― β
ββββββββββββββββ > Max(a, b)
π
Assume:
a > b
π > 0
(Max(a,b) -> a, b/c a > b)
β a bβ
logβπβ β― + πβ β― β
ββββββββββββββββ > a
π
β a bβ
logβπβ β― + πβ β― β > πa
(exp both sides)
a b πa
πβ β― + πβ β― > β―
LHS !> RHS
and 3.3:
I did the calculus and the limit looked like it was going to zero (instead of max(a,b)) so I coded up the function in numpy to check, and indeed it appears to go to 0 instead of 4 in this case (a=2, b=4).
[nav] In [478]: real_softmax = lambda x: 1/x * np.log(x*np.exp(2) + x*np.exp(4))
[ins] In [479]: real_softmax(.1)
Out[479]: 18.24342918048927
[nav] In [480]: real_softmax(1)
Out[480]: 4.126928011042972
[ins] In [481]: real_softmax(10)
Out[481]: 0.6429513104037019
[ins] In [482]: real_softmax(1000)
Out[482]: 0.01103468329002511
[ins] In [483]: real_softmax(100000)
Out[483]: 0.000156398534760132
3.1 very simple to prove, just move a or b to left, we prove no matter which one moves to left, we can get [exp(a) + exp(b)]/exp(a) or [exp(a) + exp(b)]/exp(b) and both are greater than 1 so we can prove softmax is larger.
However, I struggle to make sense of the lines that come after.
Is the result of 3.4.9 already the derivative, or is it only re-written? And how do they get from 3.4.9 to 3.4.10?
Iβm still at the beginning of my DL journey and probably need to freshen up my calculus as well. If someone could point out to me how the formula is transformed that would be great!! Iβve been trying for a while now to write it out, but canβt seem to figure out how it should be done.
this is a great explanation of how the softmax derivative (+ backprop) works which I could follow and understand. But I have problems connecting the solution back to the (more general) formula in 3.4.9
Is the result of 3.4.9 already the derivative, or is it only re-written?
3.4.9 is only the rewritten expression of lost function, not the derivative. It comes mostly from the fact log(a/b) = log(a)-log(b) and that log(exp(X)) = X
I have a question regarding Excercise 1 of this section of the book. I wonβt include details of my calculations in order to keep this as simple as possible. Sorry for my amateurism, but I couldnβt render equations in this box, so I decided to upload them as images. However, because Iβm a βnew userβ I canβt upload more than one image per comment, so Iβm posting the rest of this comment as a single image file.
Hey @washiloo, thanks for detailed explanation. Your calculation is almost correct! The reason we are not considering i not equal to j is: you will need to calculate the second order gradient, only when you can explain o_i in some formula by o_j, or it will be zero. There blog here may also help!
Thank you for your reply, @goldpiggy ! However, I donβt understand what do you mean by βonly when you can explain o_i in some formula by o_j, or it will be zeroβ. We are not computing the derivative of o_i with respect to o_j, but the derivative of dl/do_i (wich is a function of all o_iβs) with respect to o_j, and this derivative wonβt, in general, be equal to zero. Here, d means βpartial derivativeβ and l is the loss function, but I couldnβt render them properly.
In my early post, I wrote the analytical expression of these second-order derivatives, and you can see that they are zero only when the softmax function is equal to zero for either o_i or o_j, which clearly cannot happen due to the definition of the softmax function.
NOTE: I missed the index 2 for the second-order derivative of the loss function in my early post, sorry.