Training on Multiple GPUs

https://d2l.ai/chapter_computational-performance/multiple-gpus.html

  1. Is there a reason why the loss is initialized with "reduction=‘none’ " and then in the train_batch() function the reduction is done by hand, calling loss(…).sum() ? Wouldn’t be the same to just initialize the loss with "reduction=‘sum’ " ? Maybe I am missing something here.

  2. shouldn’t torch.cuda.synchronize() be called also inside the train_batch() function, right before calling allreduce() ? That would ensure that all the GPU’s ended their backpropagations, otherwise we are not guaranteed that all gradients in all GPUs are already updated at the time of the reduction. (the same can be said about the mxnet version with npx.waitall() ).

Yeah exactly, I was also thinking about the 2nd one, that why we don’t call torch.cuda.synchronize(), right before calling allreduce(), and why we instead call after the whole train_batch function. Thank you for asking them too. Were you able to get an idea of why it was done this way?

About the second question, I think the transfer of data from one GPU into another inside the allreduce function cannot happen before backward() has finished on both the GPUs involved. Since we are accessing the gradients from a certain GPU for the transfer, the GPU has to be done with all current tasks before the access can happen. Hence, we don’t need to explicitly call torch.cuda.synchronize() before it.