Training on Multiple GPUs

https://d2l.ai/chapter_computational-performance/multiple-gpus.html

  1. Is there a reason why the loss is initialized with "reduction=‘none’ " and then in the train_batch() function the reduction is done by hand, calling loss(…).sum() ? Wouldn’t be the same to just initialize the loss with "reduction=‘sum’ " ? Maybe I am missing something here.

  2. shouldn’t torch.cuda.synchronize() be called also inside the train_batch() function, right before calling allreduce() ? That would ensure that all the GPU’s ended their backpropagations, otherwise we are not guaranteed that all gradients in all GPUs are already updated at the time of the reduction. (the same can be said about the mxnet version with npx.waitall() ).

Yeah exactly, I was also thinking about the 2nd one, that why we don’t call torch.cuda.synchronize(), right before calling allreduce(), and why we instead call after the whole train_batch function. Thank you for asking them too. Were you able to get an idea of why it was done this way?