Continue Discussion 41 replies
Jun '20

pr2tik1

If we add hidden layers and non-linearity, will it be still linear regression? Why does the MLP with hidden layer performs good on “linear regression” ?

1 reply
Jun '20 ▶ pr2tik1

goldpiggy

Hey @pr2tik1, it won’t be linear regression if we add non-linearity, but statistically we called “logistic regression” a “generalized linear model” since its outputs depend on a linear combinations of model parameters (beta’s), rather than products of quotients.

Jun '20

pr2tik1

Thanks for reply @goldpiggy !

Jun '20

pr2tik1

I also have doubt regarding the normalization of the values. Here the target variable SalePrice is also normalized, which makes its values range from about (-5,5). While submitting predictions to Kaggle what step is taken to get back the range of desired SalePrice values [like 10,000, 20,000]?

1 reply
Jun '20 ▶ pr2tik1

goldpiggy

SalePrice is the label, and we are not normalizing the label values.

Jun '20

MorrisXu-Driving

There is a bug in the train_and_pred() function. test_data[‘SalePrice’] = pd.Series(preds.reshape(1, -1)[0]) only return the first prediction. To correct the error, we need to first flatten the ‘preds’ by adding .flatten() after the numpy() and delete the ‘.reshape(1, -1)[0]’

1 reply
Jun '20 ▶ MorrisXu-Driving

goldpiggy

Hi @MorrisXu-Driving, thanks for your suggestion!

I agree that flatten() is a more elegant way, while it seems like these methods have equivalent results. Any other bug exists here?

Jun '20

Dchaoqun

As far as I understood, the model should have no knowledge of the test data. Therefore,
for data preprocessing, shouldn’t we use only the training data to calculate the mean and variance,
then use them to rescale the test data?

1 reply
Jun '20 ▶ Dchaoqun

goldpiggy

Hi @Dchaoqun, great question! The normalization step here is to have all the features in the same scales, rather than one feature in range [0,0.1], another in [-1000, 1000]. The later case may lead to some sensitive weights parameters. But you are right, in the real life scenario, we may not know the test data at all, so we will assume the test and train set are following the similar feature distributions.

1 reply
Jun '20 ▶ goldpiggy

Dchaoqun

Hi @goldpiggy, thank you for the clarification!

Jul '20

S_X

Hi @goldpiggy,
In the train function, why is MSEloss used as the loss function instead of log_rmse?

2 replies
Jul '20 ▶ S_X

goldpiggy

Hey @S_X, good question! That’s why we leave it as an exercise in question 2. :wink:

1 reply
Aug '20 ▶ S_X

Andong

I guess,maybe the gradient will be too small due to the derivation of (log(y)-log(y hat))^2.I don’t know if I am right.

Aug '20

Nish

A minor point, but in the PyTorch definition of log_rmse :

def log_rmse(net, features, labels):
    # To further stabilize the value when the logarithm is taken, set the
    # value less than 1 as 1
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(torch.mean(loss(torch.log(clipped_preds),
                                       torch.log(labels))))
    return rmse.item()

Isn’t the torch.mean() call unnecessary, since loss is already the Mean Squared Error, ie the mean is already taken?

1 reply
Aug '20 ▶ goldpiggy

Nish

I tried to predict the logs of the prices instead, using the following loss function:

def log_train(preds, labels):
    clipped_preds = torch.clamp(preds, 1, float('inf'))
    rmse = torch.mean((torch.log(clipped_preds) - torch.log(labels)) ** 2)
    return rmse

And the results were much worse. During the k-fold validation step, some of the folds had much higher validation/training losses than the others:

And sometimes the plot of losses didn’t appear to descend with epochs at all:

image

I suppose this is the point of the exercise, to show that it’s a bad idea, but I’m having trouble understanding why. It seems that instead of trying to minimise some concept of absolute error we’re trying to minimise a concept of relative (percentage) error between the prediction and the reality. Why would this lead to such instability?

Edit: I have an idea. In order to maintain numerical stability we have to clamp the predictions:

clipped_preds = torch.clamp(preds, 1, float('inf'))

But if the network parameters are initialised so that all of the initial predictions are below 1 (as is what I observed debugging one run) then they could all get clamped in this way and backprop would fail as the gradients are zero/meaningless?

1 reply
Aug '20

goldpiggy

Hey @Nish, great question! Actually using log_rmse may not be a bad idea. I guess you only change the loss function but not other hyperparameters such as “lr” and “epoch”. Try a smaller “lr” such as 1, and a larger “epoch” such as 1000. What is more, the folds with high loss as 12 here might result from bad initialization, you can try net.initialize(init=init.Xavier() and more details here.)

1 reply
Aug '20 ▶ goldpiggy

Nish

Thank you @goldpiggy ! So it sounds like my final point could be correct - that the issues came from bad initialisation so that all the initial predictions get clamped to 1 and the gradient is meaningless?

1 reply
Aug '20 ▶ Nish

goldpiggy

Hey @Nish, you got the idea! Initialization and learning rate are crucial to neural network. If you read further into advanced HPO in later chapters, you will find learning rate scheduler. Keep up!

1 reply
Aug '20 ▶ goldpiggy

StevenJokes

Hi, @goldpiggy

Analogy:

Oct '20

swg104

I don’t get why we need torch.mean in the function above. Isn’t the loss, which is MSELoss has the “mean” already? Why do we need to do the mean again?

1 reply
Oct '20

goldpiggy

Hi @swg104, great catch. Would like to post a PR and be a contributor?
(However, since the final loss is divided by “n” double times, it won’t affect the weights optimization.)

Oct '20

swg104

SG https://github.com/d2l-ai/d2l-en/pull/1485

1 reply
Oct '20 ▶ swg104

goldpiggy

Thanks @swg104. Feel free to PR if anything else doesn’t look right. We appreciate you effort to promote the community! :smiley:

Apr '21

HtC

In Data preprocessing, why do we standarize the data before replacing missing values with the corresponding features’s mean? Shouldn’t be the opposite, as stated into the text?
Thx

1 reply
Apr '21 ▶ HtC

StevenJokess

@HtC
You will find the orders have same effect.
The reason why we replace missing values with the corresponding features’s mean is to keep the whole’s mean and variance same with the mean and variance before.
:upside_down_face:

1 reply
Apr '21 ▶ StevenJokess

HtC

Thx, I know the reason why we replace missing values with the mean…
I didn’t think the reverse order leads to the same result :sweat_smile:

1 reply
Apr '21 ▶ HtC

StevenJokess

I just realized it isn’t same.
Replacing missing values will make variance smaller than before, because the denominator is bigger.
So replacing missing values before standarizing the data will be bigger.

1 reply
Apr '21 ▶ StevenJokess

HtC

Yes, I was just looking at the same thing.
Hence, what is the correct order?

Aug '21

fanbyprinciple

Exercises and my silly answers

  1. Submit your predictions for this section to Kaggle. How good are your predictions?
  1. Can you improve your model by minimizing the logarithm of prices directly? What happens

if you try to predict the logarithm of the price rather than the price?

  1. Is it always a good idea to replace missing values by their mean? Hint: can you construct a

situation where the values are not missing at random?

  1. Improve the score on Kaggle by tuning the hyperparameters through K-fold crossvalidation.
  1. Improve the score by improving the model (e.g., layers, weight decay, and dropout).

  1. What happens if we do not standardize the continuous numerical features like what we have

done in this section?

Aug '21

CE_I

Only changing the hyperparameters batch_size=64 and lr=10 can achieve a result of 0.14826.

When the layer is added, although the training loss will be reduced, it will always cause over-fitting. No matter whether you use Dropout or weight attenuation, it is unavoidable.I think this data set is too simple and not suitable for deep networks.

捕获
捕获1

I don’t know if anyone has used the deep network to achieve better results, please share! :grinning:

Mar '22

kamille

Hi I wonder do we need to initialize the weight and bias before train the model? like using net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)

Nov '22

Xuqt1918

Is the section on Kaggle prediction house price data preprocessing wrong? In the article it says “we apply a heuristic, replacing all missing values by the corresponding feature’s mean.” but in the code it says “Replace NAN numerical features by 0”. Is it better to choose the mean substitution method?

Apr '23

keeper-jie

Question: Code test $E[(x-\mu)^2]$ != $\sigma^2$

According the formula below:
$E[(x-\mu)^2] = (\sigma^2 + \mu^2) - 2\mu^2+\mu^2 = \sigma^2$

When I use code to test it, the result is not equal, so I am very confused.

a=torch.tensor([1.,2.,3.])
a_mean=a.mean()
print(a_mean) # tensor(2.)

a_var=a.var() # formula: Var(x)=Sum( (x-E(x))**2 )/(n-1)
print(a_var) # tensor(1.)

a_std=a.std()
print(a_std) # tensor(1.)

b=(a-a_mean)**2
print(b)
print(b.mean()) # tensor(0.6667)
print(b.mean()==a_std**2) # tensor(False) ??? why the result not equal
May '23

Evanns_Morales

I’m getting the following error when I simply run the code as-is from the tutorial. Can anyone help me understand?

3 replies
May '23 ▶ Evanns_Morales

alexwonder

Hey, bro. I came this error the same.Do you know how to resolve it right now?

Aug '23

cclj

In the data preprocessing section, it seems that the code in the book simply standardized the numerical features with mu and sigma computed on a concatenated dataset. Wouldn’t this cause information leak?

Aug '23 ▶ Evanns_Morales

Richard_Alex

Because get_dummies generate bool values. Try this after get_dummies:

features = pd.get_dummies(features, dummy_na=True)
features *= 1

Aug '23

pandalabme

My solutions to the exs: 5.7

Sep '23

Silin_Li

In the code below, I wonder how I get the hash value?
class KaggleHouse(d2l.DataModule):
def init(self, batch_size, train=None, val=None):
super().init()
self.save_hyperparameters()
if self.train is None:
self.raw_train = pd.read_csv(d2l.download(
d2l.DATA_URL + ‘kaggle_house_pred_train.csv’, self.root,
sha1_hash=‘585e9cc93e70b39160e7921475f9bcd7d31219ce’))
self.raw_val = pd.read_csv(d2l.download(
d2l.DATA_URL + ‘kaggle_house_pred_test.csv’, self.root,
sha1_hash=‘fa19780a7b011d9b009e8bff8e99922a8ee2eb90’))

15 Jan ▶ Evanns_Morales

wtffqbpl

The methods from np.ndarray to torch.tensor may be changed, so I use another method to convert np.ndarray to torch.tensor.

        train_features = torch.from_numpy(all_features[:n_train].values.astype(float))
        test_features = torch.from_numpy(all_features[n_train:].values.astype(float))
        train_labels = torch.from_numpy(self.train_data.SalePrice.values.reshape(-1, 1)).to(dtype=torch.float32)
14 Feb

zhang2023-byte

my exercise:

  1. Average validation log mse=0.182 , score= 0.41115
    for textbook naive linear regression;
  2. No. Selection effect, house data in one side of price distribution may be hard to collect.
  3. Tuned max_epochs=20, other hyper-parameters fixed. log mse=0.12, score=0.34241;
    Tuned max_epochs=50, log mse=0.068, score=0.26531. increasing training did increased model’s performance;
    or tuned lr=0.03, max_epoch=30, log mse=0.057, score=0.22289, but too large lr will decrease the performance of model.
  4. MLP with one hidden layer (num_hiddens=256), lr=0.002, max_epoch=100, log mse=0.078, score= 0.27778, better than linear regression with same lr*max_epoch;
    Add dropout: dropout=0.5, log mse=0.0944, score=0.2935, meaning our model still underfit?
    Add Weight Decay: wd=0.1, log mse=0.0797, score=0.27613
  5. No positive value, can’t log.