Predicting House Prices on Kaggle

Nish · August 6, 2020, 1:59pm

I tried to predict the logs of the prices instead, using the following loss function:

def log_train(preds, labels):
    clipped_preds = torch.clamp(preds, 1, float('inf'))
    rmse = torch.mean((torch.log(clipped_preds) - torch.log(labels)) ** 2)
    return rmse

And the results were much worse. During the k-fold validation step, some of the folds had much higher validation/training losses than the others:

And sometimes the plot of losses didn’t appear to descend with epochs at all:

I suppose this is the point of the exercise, to show that it’s a bad idea, but I’m having trouble understanding why. It seems that instead of trying to minimise some concept of absolute error we’re trying to minimise a concept of relative (percentage) error between the prediction and the reality. Why would this lead to such instability?

Edit: I have an idea. In order to maintain numerical stability we have to clamp the predictions:

clipped_preds = torch.clamp(preds, 1, float('inf'))

But if the network parameters are initialised so that all of the initial predictions are below 1 (as is what I observed debugging one run) then they could all get clamped in this way and backprop would fail as the gradients are zero/meaningless?

goldpiggy · August 6, 2020, 5:57pm

Hey @Nish, great question! Actually using log_rmse may not be a bad idea. I guess you only change the loss function but not other hyperparameters such as “lr” and “epoch”. Try a smaller “lr” such as 1, and a larger “epoch” such as 1000. What is more, the folds with high loss as 12 here might result from bad initialization, you can try net.initialize(init=init.Xavier() and more details here.)

Nish · August 7, 2020, 1:44pm

Thank you @goldpiggy ! So it sounds like my final point could be correct - that the issues came from bad initialisation so that all the initial predictions get clamped to 1 and the gradient is meaningless?

goldpiggy · August 7, 2020, 11:52pm

Hey @Nish, you got the idea! Initialization and learning rate are crucial to neural network. If you read further into advanced HPO in later chapters, you will find learning rate scheduler. Keep up!

StevenJokes · August 8, 2020, 1:47am

Hi， @goldpiggy

Analogy：

Initialization is more like talent?
lr is more like efforts to understand world?

swg104 · October 12, 2020, 8:27pm

Nish:

def log_rmse(net, features, labels):
    # To further stabilize the value when the logarithm is taken, set the
    # value less than 1 as 1
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(torch.mean(loss(torch.log(clipped_preds),
                                       torch.log(labels))))
    return rmse.item()

I don’t get why we need torch.mean in the function above. Isn’t the loss, which is MSELoss has the “mean” already? Why do we need to do the mean again?

goldpiggy · October 15, 2020, 4:25am

Hi @swg104, great catch. Would like to post a PR and be a contributor?
(However, since the final loss is divided by “n” double times, it won’t affect the weights optimization.)

swg104 · October 15, 2020, 2:40pm

SG https://github.com/d2l-ai/d2l-en/pull/1485

goldpiggy · October 16, 2020, 5:07am

Thanks @swg104. Feel free to PR if anything else doesn’t look right. We appreciate you effort to promote the community!

HtC · April 7, 2021, 3:30pm

In Data preprocessing, why do we standarize the data before replacing missing values with the corresponding features’s mean? Shouldn’t be the opposite, as stated into the text?
Thx

StevenJokess · April 7, 2021, 3:42pm

@HtC
You will find the orders have same effect.
The reason why we replace missing values with the corresponding features’s mean is to keep the whole’s mean and variance same with the mean and variance before.

HtC · April 7, 2021, 4:08pm

Thx, I know the reason why we replace missing values with the mean…
I didn’t think the reverse order leads to the same result

StevenJokess · April 7, 2021, 4:12pm

I just realized it isn’t same.
Replacing missing values will make variance smaller than before, because the denominator is bigger.
So replacing missing values before standarizing the data will be bigger.

HtC · April 7, 2021, 4:48pm

Yes, I was just looking at the same thing.
Hence, what is the correct order?

fanbyprinciple · August 9, 2021, 2:04am

Exercises and my silly answers

Submit your predictions for this section to Kaggle. How good are your predictions?

I got a rmse value of 0.16703

Can you improve your model by minimizing the logarithm of prices directly? What happens

if you try to predict the logarithm of the price rather than the price?

The log rmse values are coming as Nan

Is it always a good idea to replace missing values by their mean? Hint: can you construct a

situation where the values are not missing at random?

It might be the case where the date data is given and dates in between are missing in that case we can put in the date

Improve the score on Kaggle by tuning the hyperparameters through K-fold crossvalidation.

I have improved it once.

Improve the score by improving the model (e.g., layers, weight decay, and dropout).

tried adding more layers but accuracy decreased.

What happens if we do not standardize the continuous numerical features like what we have

done in this section?

improper features size that are different for different features.

CE_I · August 11, 2021, 9:19am

Only changing the hyperparameters batch_size=64 and lr=10 can achieve a result of 0.14826.

When the layer is added, although the training loss will be reduced, it will always cause over-fitting. No matter whether you use Dropout or weight attenuation, it is unavoidable.I think this data set is too simple and not suitable for deep networks.

捕获1

I don’t know if anyone has used the deep network to achieve better results, please share!

kamille · March 19, 2022, 2:02pm

Hi I wonder do we need to initialize the weight and bias before train the model? like using net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)

Xuqt1918 · November 19, 2022, 8:30am

Is the section on Kaggle prediction house price data preprocessing wrong? In the article it says “we apply a heuristic, replacing all missing values by the corresponding feature’s mean.” but in the code it says “Replace NAN numerical features by 0”. Is it better to choose the mean substitution method?

keeper-jie · April 8, 2023, 2:39am

Question: Code test $E[(x-\mu)^2]$ != $\sigma^2$

According the formula below:
$E[(x-\mu)^2] = (\sigma^2 + \mu^2) - 2\mu^2+\mu^2 = \sigma^2$

When I use code to test it, the result is not equal, so I am very confused.

a=torch.tensor([1.,2.,3.])
a_mean=a.mean()
print(a_mean) # tensor(2.)

a_var=a.var() # formula: Var(x)=Sum( (x-E(x))**2 )/(n-1)
print(a_var) # tensor(1.)

a_std=a.std()
print(a_std) # tensor(1.)

b=(a-a_mean)**2
print(b)
print(b.mean()) # tensor(0.6667)
print(b.mean()==a_std**2) # tensor(False) ??? why the result not equal

Evanns_Morales · May 9, 2023, 3:15am

I’m getting the following error when I simply run the code as-is from the tutorial. Can anyone help me understand?