Information Theory

Hi! Thanks for such a helpful read about information theory. One thing that troubles/confuses me is the justification of using cross entropy as a loss function. The text begins with an argument through maximum likelihood estimation to justify using cross entropy, and then basically goes on to show how maximizing log likelihood is equivalent to minimizing CE.

What is not clear to me is why cross entropy fits into the machine learning framework from an information-theoretic point-of-view. For example, cross entropy & KL-divergence in information theory is about quantifying the inefficiency of using a compression scheme that is not optimal for the underlying probability distribution. How does that interpretation of cross entropy fit in here, where we are doing supervised training (a concept that feels detached from the original motivation of information theory)?


e.g., “A better language model should allow us to predict the next token more accurately. Thus, it should allow us to spend fewer bits in compressing the sequence. So we can measure it by the cross-entropy loss averaged over all the 𝑛 tokens of a sequence”

e.g., “In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of bits) required to communicate the labels.”

“The character perplexity of a language model on a test word is defined as the product of the inverse probability of each character appeared in the test word, normalized by the length of the word.”

Is there a typo in the equation for perplexity following this definition? i.e. I did not expect the normalization to appear in the exponent.

Hi @jcatanz, great question! PPL is the inverse of the geometric mean of a set of characters appeared in the word. The “normalized” here refers to the “geometric mean”. Let me know if it makes sense.