Stochastic Gradient Descent

https://d2l.ai/chapter_optimization/sgd.html

Hi, I wonder whether the expectation operator of the term E[R[w_t]] in equations (11.4.12) and (11.4.13) is unnecessary. And the “E” in (11.4.15) and (11.4.16) seems to be “R”. Thanks a lot.

In inequality 11.4.12, I guess we imply that

E_wt[l(xt, wt)] >= E_wt[E_xt[l(xt, wt)]] = E_wt[R(wt)]

If this is the case, I would appreciate to see a more thorough explanation.


In 11.4.15 and 16, it should be E[R(\bar{wt})] instead of E[\bar{wt}] . After all, we seek an upper bound for the deviation of the expected value of the risk from the minimum risk, which we obtain in 11.4.16.

@mli @goldpiggy

Hi @wwwu and @sanjaradylov, thanks for the discussions. We’ve just revised the proof and it can be previewed at http://preview.d2l.ai.s3-website-us-west-2.amazonaws.com/d2l-en/master/chapter_optimization/sgd.html

Just let me know if you have any further questions on it.