Softmax Regression

henryell · July 2, 2021, 8:27am

I got the same result as you.I also don’t understand why some discussions assume that the second-order partial derivative equals to variance of softmax when i=j.

henryell · July 2, 2021, 10:11am

Hi @goldpiggy ,
I think in Exercise 3.3, the condition should be \lambda -> positive infinite instead of \lambda->infinite.
Is that right?

Thanks

fanbyprinciple · July 14, 2021, 1:53am

Sorry guys. here are my wrong answers. I kind of have the feeling that most of this is wrong looking at discussion here. but just putting it here for a sense of completion. But here is my contribution anyway.

Exercises

We can explore the connection between exponential families and the softmax in some more

depth.

1. Compute the second derivative of the cross-entropy loss l(y, yˆ) for the softmax.

* after applying quotient rule for cross entropy loss the anaswer comes out to be zero.apparentlythis is wrong.

2. Compute the variance of the distribution given by softmax(o) and show that it matches

the second derivative computed above.

* Its close to zero through experiments too. but why should this happen? is second derivative essentially same as variance?

![variance_of_softamax|690x252](upload://a0zaIqc2iLDEmeoaTjobMHoXd5F.png)

fanbyprinciple · July 14, 2021, 1:54am

Assume that we have three classes which occur with equal probability, i.e., the probability

vector is 1/3

1. What is the problem if we try to design a binary code for it?

* We would need at least 2 bits, and 00,01, 10 would be used but not 11. ?

2. Can you design a better code? Hint: what happens if we try to encode two independent

observations? What if we encode n observations jointly?

* we can do it through one hot encoding where the size of array would be the number of observations n

Softmax is a misnomer for the mapping introduced above (but everyone in deep learning

uses it). The real softmax is defined as RealSoftMax(a, b) = log(exp(a) + exp(b)).

1. Prove that RealSoftMax(a, b) > max(a, b).

* verified though not proved

real_softmax

fanbyprinciple · July 14, 2021, 1:54am

Prove that this holds for λ

−1RealSoftMax(λa, λb), provided that λ > 0.
- verified not proved

Yang_Zhao · July 17, 2021, 9:05pm

There is a type in the sentence: “Then we can choose the class with the largest output value as our prediction…” it should be y hat at in the argmax rather than y

hydrogen · October 13, 2021, 12:36pm

I’ve got the same result for the second derivative of the loss but I don’t know how to compute the variance or how it relates with the second derivative.
Did you make any progress ?

achanbour · February 15, 2022, 1:02pm

Hello,

I have the following error which I cannot seem to resolve. Can anyone help me out? Thanks

Fox · March 4, 2022, 6:17am

3.1 I don’t know if it’s right…

guanyipengai · May 7, 2022, 1:31am

Think about the relationship between Hessian Matrix and Covariance Matrix

Tou_You · June 6, 2022, 11:35pm

I think because the y in this equation 3.4.10 is an one_hot vector. so the sum of y_js is equal to y_j of the correct label.

John_MacNeil · July 28, 2022, 7:07pm

I am unsure what this question is trying to get at:

Assume that we have three classes which occur with equal probability, i.e., the probability vector is (1/3,1/3,1/3).

What is the problem if we try to design a binary code for it?

Can you design a better code? Hint: what happens if we try to encode two independent observations? What if we encode n observations jointly?

My guess is simply that binary encoding doesn’t work when you have more than 2 categories, and one hot encoding with possibilities (1,0,0), (0,1,0) and (0,0,1) works better.

However, this doesn’t seem to address the specific probability or the hint given, so I think my guess is too simplistic.

John_MacNeil · July 28, 2022, 8:47pm

Any further reading suggestion for question 7?

GpuTooExpensive · September 10, 2022, 3:00pm

Here are my opinions for the exs:
I still not sure about ex.1, ex.5.E, ex.5.F, ex.6

ex.1

But, what does “match” means in question B?

ex.2
A. If I use binary code for the three class, like 00, 01, 11, then the distance between 00 and 01 is smaller than that between 00 and 11, that is oppose to the fact that the three class is of equal probability.

B. I think I should use one-hot coding mentioned in this chapter, because for any independent observation(which I think is a class), there contains no distance information between any pair of them.

ex.3
Two ternaries can have 9 different representation, so my answer is 2.
This ternary is suitable for electronics because in a physical wire, there will be three distinctive condition: positive voltage, negative voltage, zero voltage.

ex.4
A. Bradley-Terry model is like

When there are only two classes, softmax just fit this.
B. No matter how many classes there will be, if I put a higher score for class A compared to class B, the the B-T model will still let me chose class A, and after 3 times of comparing, I will chose the class with the highest score, that still holds true for the softmax.

ex.5

ex.6
This is my procedure for question A, but I can’t prove that the second derivative is just the variance.

As for the rest of the questions, I don’t even understand the question.

ex.7
A. Because of exp(−𝐸/𝑘𝑇), if I double T, alpha will go to 1/2, and if I halve it, alpha will go to 2, so T and alpha goes in opposite direction.
B. If T converge to 0, the possibility for any class will converge to 0, and the proportion between two class i and j exp( -(Ei - Ej) /kT) will also converge to 0. Like a frozen object of which all molecules is static.
C. If T converge to ∞, the proportion between two class i and j will converge to 1, which means every class has the same possibility to show up.

Ruiz_Abinos · October 11, 2022, 2:18pm

Exercise 6 Show that g(x) is translation invariant, i.e., g(x+b) = g(x)

I don’t see how this can be true for b different from 0.

kartikmadhira1 · March 16, 2023, 5:47pm

A better explanation of Information theory basics can be seen:

here

cclj · July 27, 2023, 6:37am

Ex1.

Ex6 (issue: (1) translational invariant or equivariant? (Softmax is invariant, but log-sum-exp should be equivariant); (2) b or negative b? Adding maximum can make overflow problem worse).

cclj · July 27, 2023, 6:39am

The same result… I think we may say the log-partition function is translational equivariant rather than invariant. See also this page.

pandalabme · August 17, 2023, 11:09am

To ex.1, maybe we can take softmax distribution as Bernoulli distribution with a probability of $p = softmax(o)$, so the variance is:
$$Var[X] = E[X^2] - E[X]^2 = \text{softmax}(o)(1 - \text{softmax}(o))$$
I don’t know whether this suppose is right
and my solutions to the exs: 4.1

Gabriel_de_Souza_Col · June 23, 2024, 5:19pm

Does this mean that each coordinate of each y label vector is independent of each other? Also, shouldn’t the y_j of the last 2 equations also have a “i” superscript?
I read the URL given, but it doesn’t clarify too much for this specific case.