In exercise 2 of 4.6.7, I increase the epochs to 100 with dropout1 and dropout2 (0.2/0.5), run several times. In each time, always got a result with train/test acc having a dropping part during training. In my knowledge, if increasing the epochs with other appropriate parameters, the result should be better, but not worse. See the below results:



Is there any theory to explain the result? ( I saw there was a student having the same problem using pytorch)

  1. What happens if you change the dropout probabilities for the first and second layers? In particular, what happens if you switch the ones for both layers? Design an experiment to answer these questions, describe your results quantitatively, and summarize the qualitative takeaways.
    Below are the results when dropout rates are switched(tabular format for more clarity)
  1. Using the model in this section as an example, compare the effects of using dropout and weight decay. What happens when dropout and weight decay are used at the same time? Are the results additive? Are there diminished returns (or worse)? Do they cancel each other out?

The weight decay, when used along with dropout, is at times additive and diminishing as seen from the table below.