Environment and Distribution Shift


Hi @goldpiggy, while going through section, I found that it was the mentioned that this case involved covariate shift.

As we explained to them, it would indeed be easy to distinguish between the healthy and sick cohorts with near-perfect accuracy. However, that is because the test subjects differed in age, hormone levels, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients. Due to their sampling procedure, we could expect to encounter extreme covariate shift.

However, as per the definition of Label Shift in, it is said that such cases are examples of Label Drift

For example, we may want to predict diagnoses given their symptoms (or other manifestations), even as the relative prevalence of diagnoses are changing over time. Label shift is the appropriate assumption here because diseases cause symptoms

Could you please let me know where am I lacking in understanding here? Thanks

Hi @kusur, great question! In the covariate shift, we focus on the distribution shift of “data” or “feature”. For example, the in age, hormone levels, physical activity, diet, alcohol consumption, and many more factors ... are all features for a sample point. While for the label shift focus on the distribution shift of the “label”, such as the disease diagnoses. Check more details in this paper: https://arxiv.org/pdf/1802.03916.pdf.

1 Like

@goldpiggy “If our classifier is sufficiently accurate to begin with, then the confusion matrix 𝐂 will be invertible, and we get a solution” Any ideas on why this is true. Do we have any mathematical intuition?

In section

“What we can do, however, is average all of our models predictions at test time together, yielding the mean model outputs 𝜇(𝐲̂ )”

What does an average of all of the models refer to here? I was assuming we only have one model (after evaluating the training set)

Hi @sushmit86,
If the classifier is decently accurate then the diagonals of the confusion matrix (where correct-predictions are counted) will be large.

Ex. Confusion Matrix of a 2-Class Classifier with Decent Accuracy

|  20  |   2  |
|   5  |  30  |

Recall a square matrix is invertible so long as its determinant is non-zero. Our determinant (20*30 - 2*5) is non-zero so the matrix is invertible. From the example you can see that the determinant will not be zero unless you’re miss-classifying about as many as you’re correct on.

Here’s my mathematical intuition, hope it helps!

1 Like

@six Thanks for the explanation

1 Like

Exercises and my silly answers

  1. What could happen when we change the behavior of a search engine? What might the users

do? What about the advertisers?

  • users may change their search engine to one better suited, advertisers would try to adapt to the new engine.
  1. Implement a covariate shift detector. Hint: build a classifier.
  • how is that supposed to work? if we build a classfifier and the accuracy is lower than some theshold do then we say we have built a covariatte shift detector?

I am trying to build a cat and dog detector. Here: https://www.kaggle.com/fanbyprinciple/pytorch-classification-of-dogs-vs-cats/

  1. Implement a covariate shift corrector.
  • okay cool. Wilrandom forest classifierl return to this, we need to create a

Need to come back here after kaggle question.

  1. Besides distribution shift, what else could affect how the empirical risk approximates the


  • changing environment, time, geography, bias of the people using it, culture assumptions.
1 Like

I am confused about the label shift correction and I have doubts about the consequences of it. Here we simply weight the instances with p(y)/q(y), which means that, if I am not mistaken, give more weight to the majority classes on the test set during the training.

Doesn’t it favour the majority classes on the test set? So, it will be detrimental for detecting minority classes. For example, in the medical domain, cost of missing a disease is high, therefore it is important to detect it even if there are many false-positives. Meanwhile, disease classes are generally the minority. So, if we weight training instances by incorporating the distributions in the test set, minority classes, which may be the main target, might be adversaly effected.

The confusion matrix tell a truth that which class ( i.e. A ) is easily to be misclassified ,consequently, during the later training, we should give more weight to these samples in A class.

Here are some of my poor thoughts on how to detect covariate shift.

Since the training dataset and the test dataset are in different distribution results in covariate shift, maybe we can mix those two datasets as a new training dataset, train a new model, and evaluate the difference between the new model and the original one. If they are different, we can say the new training dataset must have different distribution from our original training dataset.

But here is my question: how to evaluate two models and, to what degree we can say they are different or the same?

To the authors:
Thanks for contributing this great resource! A comment on the section about distribution shift:
I appreciate that the book includes this topic in the introductory chapters. For your consideration: As the book has so many good references, I was expecting to see citations in this section too. I think that it’d be great to add citations here. Besides references specific to the methods, could you please add one or two general references, such as A unifying view on dataset shift in classification - ScienceDirect and/or a more recent review paper? Thanks!

Since the denominator in Eq 4.7.6 represents a mixture, should p(x) and q(x) be weighted so that the sum of their weights equals 1 and the sum represents a density ? Although in practice it wouldn’t matter because the denominators get cancelled out.
Please let me know if I’m missing something. Thanks for the excellent exposition!

Equation 4.7.10 appears to be at odds with the definition of the confusion matrix in the text. Here is my derivation:

μ(y_hat) = E_x~p(x)[ f(x) ]
= ∫ f(x) p(x) dx
= ∫ f(x) [ ∫ p(x, y) dy ] dx
= ∫ f(x) [ ∫ p(x | y) p(y) dy ] dx
= ∫ [ ∫ f(x) q(x | y) dx ] p(y) dy
= C p(y)

Thus, the columns of C should each add up to 1, just like the stochastic matrix in the Markov chain. In the extreme case where the classifier is absolutely accurate, C should be the identity matrix (not something like diag(1/k, …, 1/k), according to the definition in the text), and p(y) = μ(y_hat). Am I missing something?

My solutions to the exs: 4.7

I think can go before 4.6, as it defines risk and empirical risk used in 4.6.