The size of the update step is determined by the learning rate
lr. Because our loss is calculated as a sum over the minibatch of examples, we normalize our step size by the batch size (
batch_size), so that the magnitude of a typical step size does not depend heavily on our choice of the batch size.
I didn’t get this, can someone explain in simpler words?
I hope my words are simpler . From my understanding of the passage, in the weight update equation (w:=w - lr * D, where D is the gradient ) after each step of training on a minibatch (let’s say m examples per minibatch) we divide the total minibatch gradient with the size of the minibatch (which is m, so D=minibatch_grad/m) and then multiply by the learning rate, thus the greater effect on our step size towards the minimum is heavily depend on lr rather than m.
I agree with you, instead of using w:=w - D, which is heavily depends on m, we introduce lr to set the limit for D, which is now less influance of m.
From my understanding, we divide the total loss to batch size in order to get average loss for a given batch. Contrary to total loss, average loss does not depend on the batch size.