Minibatch Stochastic Gradient Descent

https://d2l.ai/chapter_optimization/minibatch-sgd.html

Instead of timer.avg() I think you should write timer.sum()/num_epochs ,in the print statement of function train_ch11 .
If I am right, then it is not true that the time required per epoch for minibatch sgd (when batch_size=100) is shorter than the time needed for batch gradient descent.

I think it would be good to mention the linear scaling rule and square root scaling rule in this chapter – or link to a chapter that discusses them. The linear scaling rule is promoted, for example, by Horovod, IIRC.