Dropout

fanbyprinciple · October 11, 2021, 5:32pm

Hi, AdaV when I implemented it , it somehow was the case. But I am not sure of the veracity of my claims. I guess I am the most unreliable person on this chat !XD

randomgeek78 · February 9, 2022, 7:47pm

Since this is my first post, I was not allowed to post any embedded content. I wrote up a quick set of notes here-

gist.github.com

https://gist.github.com/randomgeek78/89ec0242ac756db88d14e2d1982abae9

notes-dropout-d2l.md

Here are my responses to the questions in the exercises - 
### Q: What effect does different update schemes have on the learning rate?
Using dropout1=0.2 and dropout2=0.5, I compared the loss and accuracy for the default stochastic gradient optimizer and Adam optimizer for 100 epochs-

SGD optimizer (learning rate 0.5)
![image](https://user-images.githubusercontent.com/66195352/153276943-7a4b7175-9bff-4896-978a-8f4704173d19.png)

Adam optimizer (learning rate = 1e-3)
![image](https://user-images.githubusercontent.com/66195352/153276943-7a4b7175-9bff-4896-978a-8f4704173d19.png)

This file has been truncated. show original

bigtimecodersean · May 23, 2022, 7:50pm

I would love some guidance on question 3. How might we visualization or calculate the activation, or variance of the activation, of hidden layer units?
Thanks

Roberto_Carlos_Cruz · September 6, 2022, 11:09pm

I am confused, in the last line of Sec. 5.6:

By design, the expectation remains unchanged, i.e., E[h’] = h

Is it correct? or should be E[h’] = E[h]?

yzzzzz · February 19, 2023, 5:34am

Exercise 6:
dropout one row of W(2) at a time is equivalent to dropout on the hidden layer.
dropout one col of W(2) at a time is equivalent to dropout on the output layer.
A total random dropout on W probably leads to worse slower converging speed.

pandalabme · August 22, 2023, 10:33am

My solutions to the exs: 5.6

thomas-pegot · August 5, 2024, 3:52pm

Hi
I don’t understand the point of this : X.reshape((X.shape[0], -1))
It will reshape X as the same shape.

zhang2023-byte · February 13, 2025, 8:53am

my exercise:

decrease dropout: didn’t see any change of results;
increase dropout: val_acc significantly decrease when dropout > 0.9;
without dropout: see sudden decrease and increase of val_acc with epoch increasing, is this a sign of overfit? double decent?
I guess the var will increase after dropout be applied.
I think dropout will decrease the performance of model during test, and you have no benefit by doing this.
I find adding weight decay reduced the performance of my MLP, the performance rank: MLP+WD < MLP+WD+dropout<MLP+dropout. Is this because WD impaired the express ability of MLP? Below are my code, I’m not sure if its correct:
class WD_DropOutMLP(d2l.Classifier):
def init(self, num_outputs, num_hiddens_1, num_hiddens_2, dropout_1, dropout_2, lr, wd):
super().init()
self.save_hyperparameters()
self.wd = wd
self.net = nn.Sequential(
nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(), nn.Dropout(dropout_1),
nn.LazyLinear(num_hiddens_2), nn.ReLU(), nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))

def configure_optimizers(self):
params = list(self.net.named_parameters())
weight_params = [param for name, param in params if ‘weight’ in name]
bias_params = [param for name, param in params if ‘bias’ in name]
return torch.optim.SGD([
{‘params’: weight_params, ‘weight_decay’: self.wd},
{‘params’: bias_params}], lr=self.lr)
TBD
TBD