Thanks again for the reply. However, I still have some follow up issues unclear:
Without the dropout in your example, only 2 of the 4 neurons take effects mostly. In other words, the model may have less representation property (due to smaller capacity in the model) without using dropout. So my understanding from your example is: a model with more capacity should be more stable than a model with less capacity. Is my statement correct?
I do not think A & B will be too “lazy” to adjust weights. Assume the model is y = f(X) = A * x_1 + B * x_2 + C * x_3 + D * x_4, the gradient on A is dl/dy * dy/dA and the dy/dA = x_1, so the adjustment on A is mainly rely on the feature x_1 instead of the value of A.
Without the dropout in your example, only 2 of the 4 neurons take effects mostly. In other words, the model may have less representation property (due to smaller capacity in the model) without using dropout. So my understanding from your example is: a model with more capacity should be more stable than a model with less capacity. Is my statement correct?
We don’t dropout during inference, so we still have 4 neurons to keep the original capacity.
I do not think A & B will be too “lazy” to adjust weights. Assume the model is y = f(X) = A * x_1 + B * x_2 + C * x_3 + D * x_4 , the gradient on A is dl/dy * dy/dA and the dy/dA = x_1 , so the adjustment on A is mainly rely on the feature x_1 instead of the value of A.
Your intuition is right! Theoretically it depends on the input features and the activation function. So the dropout method “force” the neurons A&B to learn if their features are not as effective as the other neurons’ features.
what dropout ultimately does is to overcome “Overfitting” and stabilize the training.
Do you have any recommended papers or blogs that have theoretical support about how dropout can help stabilize the training (e.g., avoid gradient explode and varnish)? I find this is very interesting to explore.
For any ML problem, we ultimately care about the model performance through evaluation metrics (such as accuracy). However, lots of metrics are not differentiable, hence we use the loss functions to approximate them.
We use loss function just for training’s Backpropagation, and we don’t need to train anymore when we are in test. Similiarly, we just care the final scores of exam instead of the concrete answers.
Do I understand it rightly?
Is there an error in my implementation? Or is this because of the simple nature of the dataset (eg I note that dropout doesn’t help too much over the standard implementation in the first place) and would I normally notice some difference? Thanks!
Hi @Nish, great question! It may be hard to observe a huge loss/acc difference if the network is shallow and can converge quickly. As you can find in the original dropout paper (http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf) as well, the improvement with dropout on MNIST is less than 1%.
Hi, I’ve noticed that we enable Dropout while training, and we may disable it while doing inference. But whether we enable or disable Dropout while testing? If I am not mistaken, in our implementation, we have allowed Dropout in the test. Is that right?
Also, I have another question. We implement Dropout just by zeroing some elements in the forward stage, but whether the weights corresponding to those zeros update in the backward stage?
However, there are two main reasons you should not use dropout to test data:
Dropout makes neurons output ‘wrong’ values on purpose
Because you disable neurons randomly , your network will have different outputs every (sequences of) activation. This undermines consistency.
the weights corresponding to those zeros will still update, because the weight not only corresponds to those zeros, but also other not zeros in the same hidden layer.
Thanks for your answer. But I’m still wondering for something:
How we disable Dropout in testing in the implementation of this book, either from scratch or using high-level API? I’ve found in the implementation from scratch that during training and testing we use the same net, whose attibute is_training is to True. Also, in the implementation using high-level API, I didn’t find the difference of Dropout layer while training and testing.
So could you please tell me where in the code did we disable Dropout while testing?
because the weight not only corresponds to those zeros, but also other not zeros in the same hidden layer.
Is this means each weight corresponds to multiple hidden units? I’m not sure about this, but I think each weight corresponds to a unique input (hidden units) and a unique output in MLP.
After we dropouted, the corresponding hidden layers have already changed. Enable/Disable is confusing.
Test is just a process of calculating a prediction by weights after bp. We don’t need to disable Dropout!
Yes, matrix multiplication is the reason of corresponding to multiple hidden units.
More intuitively, we might argue that we encouraged the model to spread out its weights among many features rather than depending too much on a small number of potentially spurious associations.
In the interactive example from google’s crash course I’m seeing roughly the same amount of features being used when comparing training the model with and without L2 regularization/weight decay:
This means I couldn’t verify the claim or intuition that weight decay spreads out it’s weight among many features.