https://d2l.ai/chapter_computational-performance/multiple-gpus.html
Given we increase the effective batch size by a factor of k when training with k GPUs, shouldn’t we be decreasing (instead of increasing as stated) the LR by a factor of k to make up for the approximately k-times larger weight update that results from the increased batch size per iteration?
Why?
I think LR should increase to catch up with batch size increasing.
why Large minibatches may require a slightly increased learning rate.