Generalization in Classification

https://d2l.ai/chapter_linear-classification/generalization-classification.html

1 Like

popullation (spelling)

Fix the order for "A number of burning questions demand immediate attention: 1 
 1
 1
 "

Can someone help me understand the problem of false discovery discussed in 4.6.2. I am unable to understand why using the same test set with multiple classifiers/models is a problem. As long as hyper parameter tuning is done on a validation set what is the problem in bench-marking performance of different models on the same test set. Similarly what exactly is adaptive over-fitting and how is it different from the problem of false discovery

“No such a uniform convergence result could possibly hold.”

Addition about Hoeffding’s inequality

When I learn this chapter, what confused me is that the formular. How to learn the new formular? One of the best way is give an example with number.
Formular as below:
$$P(\epsilon_\mathcal{D}(f) - \epsilon(f) \geq t) < \exp\left( - 2n t^2 \right).$$
Condition: 95% confidence that the distance between our estimate and the true error rate does not exceed 0.01.
JS code to calculate the n:

Math.log(0.05)/(-2*0.01*0.01)

Which 0.05 is the value of the right-hand side of the inequality calculated by 1-95% (95% is the confidence), 0.01 is the t

My solutions to the exs: 4.6

I think the way it works is: we’re saying we are 95% confident that we’ve produced a model that generalizes well i.e. the error rate that is observed on the test data is very close to the true error rate we would get if we were somehow able to test against every single data that exists. Every time we produce a model, we take a gamble that our model WILL NOT be part of that 5% faulty population. The more models we produce, the more likely it is that we’ll get one that is faulty.

I would like to add some extra information on multiple model comparison. In statistical learning, when we perform multiple hypothesis testing, if we have threshold α=0.05 for p-value and test each hypothesis individually with α, we no longer have False Positive Rate of 0.05. In this case, we can define Family-Wise Error Rate, FWER, as probability of at least making one false discovery, or false positive, it turns out that for m number of hypothesis tests, FWER=1 − (1−α)^m, for threshold 0.05, and m=20, we have 0.64 probability of making one False Discovery. If you are interested instatistical learning perspective more, you can refer to “Introduction to Statistical Learning” book, last chapter and check out Bonferroni, Holm’s step down, Tukey’s methods for controlling FWER, or Benjamini-Hochberg procedure for controlling FDR.