I would like to ask a question about SGD and mini-batch SGD as explained in the book. Chapter 11.4 states that in SGD each iteration comprises only one example. Nevertheless, when implementing mini-batch SGD (which would be the same but with batches of size k instead of 1) we often run the model for several epochs. Hence, does that mean that SGD (with batch size = 1) should iterate throughout the whole dataset for one epoch (and making n updates, where n is the training dataset number of examples)? Would not that be the same as standard batch gradient descent?
@deep_user Regardless of batch size (1
, k > 1
, or even k = n
), it is recommended to have more than one round of update. Although there might be cases when, say, the dataset is big, with similar entries, so one or two epochs might suffice.
Would not that be the same as standard batch gradient descent?
No, because stochastic gradient is an estimate of gradient.
@sanjaradylov Okay, I was confused about how it is framed in Chapter 11.4 because here they do not iterate over the whole dataset, only over 50 examples. But now I understand the point that, even if you need to run several epochs, you are able to update the weights with an estimate of the gradient more often and hence need fewer epochs compared to standard GD.
Thank you very much!