Training on Multiple GPUs

  1. Is there a reason why the loss is initialized with "reduction=‘none’ " and then in the train_batch() function the reduction is done by hand, calling loss(…).sum() ? Wouldn’t be the same to just initialize the loss with "reduction=‘sum’ " ? Maybe I am missing something here.

  2. shouldn’t torch.cuda.synchronize() be called also inside the train_batch() function, right before calling allreduce() ? That would ensure that all the GPU’s ended their backpropagations, otherwise we are not guaranteed that all gradients in all GPUs are already updated at the time of the reduction. (the same can be said about the mxnet version with npx.waitall() ).