Information Theory

https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html

Hi. Maximum likelihood is searching for theta that maximize the probability of the observations knowing the parameters P(X|theta). This is not what is written in your book but it’s ok in this appendix. However when it comes to cross entropy both in the book and in the appendix you maximize P(y|x) which is not the maximum likelihood definition, this is very disturbing. Can anyone please explain why this should be ok ? Thanks

Hi, is there please anyone to answer this ? This won’t change the result but improve a lot the understanding. Thanks

Hi @neoaurion, sorry for the delay reply. Can i know where do you quote this sentence? (Maximum likelihood is searching for the parameter (i.e. theta) given the current observation data (i.e. X).)

Hi @goldpiggy,
This is what I’ve learned from my classical signal processing background but you can find it here https://www.analyticsvidhya.com/blog/2018/07/introductory-guide-maximum-likelihood-estimation-case-study-r/ . Where f(X|theta) is the probability of X (data) knowing theta, the parameters.

Hi @neoaurion, I see your questions now. Fundamentally, we still want to maximize l(\theta) , i.e., find the optimal “\theta”, as we state in Cross Entropy section. While the only difference for cross entropy is that we have two sets of data X, Y, which come from two different distributions. That looks like maximizing P(X, Y | \theta) now. Does it make sense?

Yes it’s right with P(X, Y | \theta). I think that the result on the cross entropy is right but the explanation is not.

H(X) ≥ 0 in 18.11.2.4. is only true for discrete distributions. For example, the continuous distribution U[0, 0.5] has negative entropy. :grinning:

1 Like

Great catch! Corrected in this PR. :slight_smile:

1 Like

Hi @goldpiggy,

Thanks for putting together this great appendix. Can you explain why eq 18.11.7 is true?