Concise Implementation for Multiple GPUs

astonzhang · November 11, 2020, 6:24pm

https://d2l.ai/chapter_computational-performance/multiple-gpus-concise.html

shaoming_xu · February 7, 2021, 4:55pm

X, y = X.to(devices[0]), y.to(devices[0])
In code, we send data to device 0. In this case, how to divide data parallel in a batch to different GPUs?

anirudh · February 14, 2021, 6:33am

In PyTorch concise implementation we utilize nn.DataParallel.
It splits the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device). In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.

Raymond_Sting · June 17, 2021, 7:41pm

In the pytorch implementation, if multiple devices (GPU’s) are used, the loss function l will be an array of losses (one per device) so l.backward() will return an error. Need to do l.sum().backward() or l.mean().backward(). Am I missing something?