The size of the update step is determined by the learning rate
lr. Because our loss is calculated as a sum over the minibatch of examples, we normalize our step size by the batch size (
batch_size), so that the magnitude of a typical step size does not depend heavily on our choice of the batch size.
I didn’t get this, can someone explain in simpler words?
I hope my words are simpler . From my understanding of the passage, in the weight update equation (w:=w - lr * D, where D is the gradient ) after each step of training on a minibatch (let’s say m examples per minibatch) we divide the total minibatch gradient with the size of the minibatch (which is m, so D=minibatch_grad/m) and then multiply by the learning rate, thus the greater effect on our step size towards the minimum is heavily depend on lr rather than m.
I agree with you, instead of using w:=w - D, which is heavily depends on m, we introduce lr to set the limit for D, which is now less influance of m.